Fastest way to perform an "Insert if not exists" operation

Hi guys,

As many of you know this blog was born with the main intent of talking about optimizations and performance.

Today we see what is the fastest way to perform an operation that in reality often happens to perform.

Let’s talk about "Insert if not exists" or insert new records in a table only if they are not already present.
Ready to measure with your stopwatch in hand?

Enjoy the reading!

The "Insert if not exists" operation

Suppose you have a table which contains a list of all the products in our store whos name is Products

Suppose you have another table NewProducts, with the same structure, which also contains a list of products.

What we want to do is to insert in the Products table all the products found in the NewProducts table that do not exist in the Products table.

This is the structure of the two tables:


CREATE TABLE PRODUCTS (ID INT IDENTITY(1,1), CODE VARCHAR(20) PRIMARY KEY CLUSTERED ([ID] ASC ))

CREATE TABLE NEWPRODUCTS (ID INT IDENTITY(1,1), CODE VARCHAR(20) PRIMARY KEY CLUSTERED ([ID] ASC ))

The field we look at to establish if a row is already present or not is the CODE field

We have basically 4 ways to do this "Insert if not exists" operation.

1) NOT EXISTS

The first way to implement what is required is using the NOT EXISTS operator.

We can write the following statement:


INSERT INTO PRODUCTS(CODE)
  SELECT N.CODE FROM NEWPRODUCTS N
    WHERE NOT EXISTS (SELECT 1 FROM PRODUCTS P WHERE P.CODE = N.CODE)

2) LEFT OUTER JOIN

We can also obtain the same result using a LEFT OUTER JOIN.


INSERT INTO PRODUCTS(CODE)
  SELECT N.CODE 
  FROM NEWPRODUCTS N
  LEFT OUTER JOIN PRODUCTS P ON P.CODE = N.CODE
  WHERE P.ID IS NULL

3) MERGE

Another way to obtain the result is to use the MERGE command.

This command allows you to do at the same time insert, udpate and delete and for this reason is very interesting.


INSERT INTO PRODUCTS(CODE)
  SELECT N.CODE 
  FROM NEWPRODUCTS N
  LEFT OUTER JOIN PRODUCTS P ON P.CODE = N.CODE
  WHERE P.ID IS NULL

4) EXCEPT

The last way is to use the EXCEPT operator. This is the most elegant solution.


INSERT INTO PRODUCTS(CODE)
  SELECT CODE FROM NEWPRODUCTS
  EXCEPT
  SELECT CODE FROM PRODUCTS

The Question is?

Will all of these solutions have the same results? Absolutely NOT!

So let’s run a nice benchmark to figure out which solution is best to use.
In the following paragraph we see the script used to run the benchmark ... this in case you also wanted to try...

The benchmark procedure

I wrote the following procedure PREPARE_TEST used to create and fill our two tables with data.

Every time I run the procedure I drop the two tables, I create them, I fill them with data and finally I execute the command DBCC FreeprocCache.


CREATE PROCEDURE PREPARE_TEST (@N int, @overlap int) AS
BEGIN
    SET STATISTICS IO OFF

	DROP TABLE PRODUCTS
	CREATE TABLE PRODUCTS (ID INT IDENTITY(1,1), CODE VARCHAR(20) PRIMARY KEY CLUSTERED ([ID] ASC ))

	;WITH U AS (
		SELECT 1 AS N
		UNION ALL
		SELECT N + 1
		FROM U
		WHERE N < @N
	)
	INSERT INTO PRODUCTS (CODE)
	SELECT
	  RIGHT( '00000000000000000000' + CAST( ROW_NUMBER() OVER ( ORDER BY N) AS VARCHAR(20)) , 20) AS CODE 
	FROM U
	OPTION(MAXRECURSION 0);

	DROP TABLE NEWPRODUCTS
	CREATE TABLE NEWPRODUCTS (ID INT IDENTITY(1,1), CODE VARCHAR(20) PRIMARY KEY CLUSTERED ([ID] ASC ))

	;WITH U AS (
		SELECT 1 AS N
		UNION ALL
		SELECT N + 1
		FROM U
		WHERE N < @N
	)
	INSERT INTO NEWPRODUCTS (CODE)
	SELECT
	  RIGHT( '00000000000000000000' + CAST( @overlap + ROW_NUMBER() OVER ( ORDER BY N) AS VARCHAR(20)) , 20) AS CODE 
	FROM U
	OPTION(MAXRECURSION 0);

	DBCC FREEPROCCACHE
END

I can choose how many rows the tables will contain by specifying the parameter N and I can choose also how many rows are missing by specifying the parameter N overlap.

In the following picture we can see the content of the two table.

I use an equal overlap value to make 10 so the product table will always be missing 10 rows

Once data structure are prepared and filled with data I wrote 4 procedure that perform the insert using the 4 way seen before.

The four procedure are named RUN_TEST_NOT_EXISTS, RUN_TEST_LEFT_OUTER_JOIN, RUN_TEST_MERGE and RUN_TEST_EXCEPT.

Duration measured is the sum of 10 run of each procedure. Every time tables are dropped, ricreated and filler with data.

The Results

As said, I created four procedures and then for each one of them I run it with an increasing number or rows inside the table. Tests are done with the parameter N equal to 1K rows, 10K rows, 100K Rows, 1M Rows and 10M Rows.

The Duration is expressed in microsecond.

These are the results:

I think the following graphic is very clear.

Globally the best way is the NOT EXISTS at the first place and LEFT OUTER JOIN at the second place.

When the rows are few (1K) (in blue) all the ways are good, but the MERGE starts to take twice the time.

At 10K rows the winner is the NOT EXISTS.

At 100K rows and at 1M i got some strange result that i must indagate!

When Rows reach 10M of rows result is clear: NOT EXISTS at the first place followed by the LEFT OUTER JOIN

Avoid the MERGE and the EXCEPT when the number of rows are higher!

Conclusion

What emerges from this test is that for the realization of our operation of "Insert If Not Exists" there are more ways but in case tables that contain a lot of data it is better to move towards the solution that uses the NOT EXISTS or the LEFT OUTER JOIN.

The “Not Exists” solution is globally the fastest solution!

If someone wants to perform the same tests we could share the results willingly!

That's all for today, have a nice weekend and stay tuned and ....

~Luke

Previous post: NEWS! SQL Server 2022 (GA) is OUT!

Search This Blog

SQL Server Performance Blog where milliseconds matter