Fastest way to perform an "Insert if not exists" operation
Hi guys,
As many of you know this blog was born with the main intent of talking about optimizations and performance.
Today we see what is the fastest way to perform an operation that in reality often happens to perform.
Let’s talk about "Insert if not exists" or insert new records in a table only if they are not already present.
Ready to measure with your stopwatch in hand?
Enjoy the reading!
The "Insert if not exists" operation
Suppose you have a table which contains a list of all the products in our store whos name is Products
Suppose you have another table NewProducts, with the same structure, which also contains a list of products.
What we want to do is to insert in the Products table all the products found in the NewProducts table that do not exist in the Products table.
This is the structure of the two tables:
CREATE TABLE PRODUCTS (ID INT IDENTITY(1,1), CODE VARCHAR(20) PRIMARY KEY CLUSTERED ([ID] ASC ))
CREATE TABLE NEWPRODUCTS (ID INT IDENTITY(1,1), CODE VARCHAR(20) PRIMARY KEY CLUSTERED ([ID] ASC ))
The field we look at to establish if a row is already present or not is the CODE field
We have basically 4 ways to do this "Insert if not exists" operation.
1) NOT EXISTS
INSERT INTO PRODUCTS(CODE)
SELECT N.CODE FROM NEWPRODUCTS N
WHERE NOT EXISTS (SELECT 1 FROM PRODUCTS P WHERE P.CODE = N.CODE)
2) LEFT OUTER JOIN
We can also obtain the same result using a LEFT OUTER JOIN.
INSERT INTO PRODUCTS(CODE)
SELECT N.CODE
FROM NEWPRODUCTS N
LEFT OUTER JOIN PRODUCTS P ON P.CODE = N.CODE
WHERE P.ID IS NULL
3) MERGE
INSERT INTO PRODUCTS(CODE)
SELECT N.CODE
FROM NEWPRODUCTS N
LEFT OUTER JOIN PRODUCTS P ON P.CODE = N.CODE
WHERE P.ID IS NULL
4) EXCEPT
The last way is to use the EXCEPT operator. This is the most elegant solution.
INSERT INTO PRODUCTS(CODE)
SELECT CODE FROM NEWPRODUCTS
EXCEPT
SELECT CODE FROM PRODUCTS
The Question is?
Will all of these solutions have the same results? Absolutely NOT!
So let’s run a nice benchmark to figure out which solution is best to use.
In the following paragraph we see the script used to run the benchmark ... this in case you also wanted to try...
The benchmark procedure
I wrote the following procedure PREPARE_TEST used to create and fill our two tables with data.
Every time I run the procedure I drop the two tables, I create them, I fill them with data and finally I execute the command DBCC FreeprocCache.
CREATE PROCEDURE PREPARE_TEST (@N int, @overlap int) AS BEGIN SET STATISTICS IO OFF DROP TABLE PRODUCTS CREATE TABLE PRODUCTS (ID INT IDENTITY(1,1), CODE VARCHAR(20) PRIMARY KEY CLUSTERED ([ID] ASC )) ;WITH U AS ( SELECT 1 AS N UNION ALL SELECT N + 1 FROM U WHERE N < @N ) INSERT INTO PRODUCTS (CODE) SELECT RIGHT( '00000000000000000000' + CAST( ROW_NUMBER() OVER ( ORDER BY N) AS VARCHAR(20)) , 20) AS CODE FROM U OPTION(MAXRECURSION 0); DROP TABLE NEWPRODUCTS CREATE TABLE NEWPRODUCTS (ID INT IDENTITY(1,1), CODE VARCHAR(20) PRIMARY KEY CLUSTERED ([ID] ASC )) ;WITH U AS ( SELECT 1 AS N UNION ALL SELECT N + 1 FROM U WHERE N < @N ) INSERT INTO NEWPRODUCTS (CODE) SELECT RIGHT( '00000000000000000000' + CAST( @overlap + ROW_NUMBER() OVER ( ORDER BY N) AS VARCHAR(20)) , 20) AS CODE FROM U OPTION(MAXRECURSION 0); DBCC FREEPROCCACHE END
I can choose how many rows the tables will contain by specifying the parameter N and I can choose also how many rows are missing by specifying the parameter N overlap.
Once data structure are prepared and filled with data I wrote 4 procedure that perform the insert using the 4 way seen before.
The four procedure are named RUN_TEST_NOT_EXISTS, RUN_TEST_LEFT_OUTER_JOIN, RUN_TEST_MERGE and RUN_TEST_EXCEPT.
Duration measured is the sum of 10 run of each procedure. Every time tables are dropped, ricreated and filler with data.
The Results
As said, I created four procedures and then for each one of them I run it with an increasing number or rows inside the table. Tests are done with the parameter N equal to 1K rows, 10K rows, 100K Rows, 1M Rows and 10M Rows.
The Duration is expressed in microsecond.
These are the results:
I think the following graphic is very clear.
When the rows are few (1K) (in blue) all the ways are good, but the MERGE starts to take twice the time.
At 10K rows the winner is the NOT EXISTS.
At 100K rows and at 1M i got some strange result that i must indagate!
When Rows reach 10M of rows result is clear: NOT EXISTS at the first place followed by the LEFT OUTER JOIN
Avoid the MERGE and the EXCEPT when the number of rows are higher!
Conclusion
What emerges from this test is that for the realization of our operation of "Insert If Not Exists" there are more ways but in case tables that contain a lot of data it is better to move towards the solution that uses the NOT EXISTS or the LEFT OUTER JOIN.
The “Not Exists” solution is globally the fastest solution!
If someone wants to perform the same tests we could share the results willingly!
Previous post: NEWS! SQL Server 2022 (GA) is OUT!
Comments
Post a Comment