Sql Server: The execution plan and the cost of the operators inside it. The clustered index scan operator. Part 1.

Hi friends!


Welcome back, today we will not continue to talk about the parser part of SQL Server as in the previous post, we will do it in the next post!

Today we start a short new series of posts that talks about the Execution plan and how to calculate the cost of the operators inside it.

Are you ready? Three, two, one ...go!

 

A bit of theory

One of the less explored part of SQL Server is the logic contained inside the optimizer.

We already talked about the optimizer in greater detail here

It's goal is to find a good execution plan in order to resolve our Query.
For example, is the optimizer that chooses which type of JOIN to use to resolve the T-SQL command.

So you can follow me step by step, i have prepared this example for you

       
Create table table1 (id int identity (1,1), codice varchar(40), descr varchar(80))
Create table table2 (id int identity (1,1), codice varchar(40), descr varchar(80))
Create table table3 (id int identity (1,1), codice varchar(40), descr varchar(80)) 
Create table table4 (id int identity (1,1), codice varchar(40), descr varchar(80))

Create clustered index idx_clustered_table1_id on table1(id)
Create clustered index idx_clustered_table2_id on table2(id)
Create clustered index idx_clustered_table3_id on table3(id)
Create clustered index idx_clustered_table4_id on table4(id)

Insert into table1 (codice,descr) select top 1000 name as codice, '' as descr from sys.columns
Insert into table2 (codice,descr) select top 171 name as codice, '' as descr from sys.columns
Insert into table3 (codice,descr) select top 170 name as codice, '' as descr from sys.columns    
       
 


Note: We have create 4 tables. Inside the table1 we have 1000 rows, in table2 171 rows, in table3 170 rows while in table4 only 1 row.

Now, run this two queries.

       
 
Select a1.id,a2.id from table1 a1    -- hash match
   join table2 a2 on a1.id = a2.id
        
Select a1.id,a2.id from table1 a1   -- inner join 
  join table3 a2 on a1.id = a2.id
       
 


What can you observe looking by looking to the execution plan? 

Hei! For the same query the optimizer chooses two different JOIN Operator: An Hash Match and A Nested Loop.

 


Just a little note: note that in both cases the table with less rows is choosen as "start" table.

Ok, let me say it again:

For the first Query an hash match operator is choosen.
For the second Query a nested loop is choosen.

But why?

The Optimizer search for an execution plan where the global cost is minimum cost and in order to assign a cost for each operator use various parameters: one of the parameters used is the number of rows.  (Table2 have 171 rows while Table2 170)

For the sake of clarity: You should keep in mind that this is a very simple example where we don't have any filter (WHERE) clause and so the entire table is read.
Normally the optimizer use statistics to get the number of row to be joined. But we will talk about more later.

In order to respond to the request you need to read the next paragraph where we start to compute the cost of each operator included in the execution plan.

 

How to calculate the cost of an execution plan!


Each operator inside the execution plan have a cost given by the sum of the CPU cost and the I/O cost.

The Clustered Index Scan Operator cost

Question: What is the cost of the Clustered Index Scan used to retrieve data from Table2? Our 170 rows..

fig.1

Answer: We have a I/O cost of 0,003215 and CPU cost of 0,000344

Yes but where theese number come from? How to calcolate them?

 

I/O Cost

Trying to model the I/O cost should be logical assume that:

  • It is proportional to the number of pages read
  • It is the sum of two parts: a random access I/O cost and and sequential I/O cost.


We can try this formula:

I/O COST = C + K * (Npages-1)

where:

  • C is a constant that take care of the random access time part
  • K is a constant and NPages are the number of pages to be read (sequential part)


The last info we need is how to get the number of pages read

In our case is more simple due to the fact that we read the whole table (since we have no index at all) and so we can use the command:

dbcc showcontig (Table) with tableresults

In our case "dbcc showcontig ('Table2')" return 1 page.

fig.2

 

0,003125 = C + K * (1 -1)

Finally we need to take another misure executing the same select with the table having a different number of rows. This is a simple matematical fact: To resolve a linear equation with 2 incognite we need 2 equations.

0,0053472 = C + K * (4-1)

Let's do some easy mathematic to find C and K values

C = 0,003125

K = (0,0053472 - 0,003125) / 3 = 7,4073 x 10^-4 = 0,00074073


These are the first two magic number.

The I/O cost for a Clustered Index Scan operator formula is:

I/O Cost = 0,003215 + 0,00074073 * (npages-1)

 

Note that: 

  • 0,00325 = 1 / 320 s = 320 IOPS where each I/O fetch an 8 KB page from disk
  • 0,00074074 = 1 / 1350 s = 1350 pages/sec.


The CPU Cost

And now for the CPU cost for the same operator.

While the I/O cost is proportional to the number of pages to be read, the CPU cost is legated to the number of row.

For the cost model we can suppose a model like this:

CPU Cost = A + B * Nrows

Where:

A is a constant that take care of the CPU cycle during the initialization of the procedure

B is a costant that describe the number of CPU cycle needed for each row.

 

Now running a select from "Table4" and then from Table1 we have that

0,0001581 = A + B * 1

0,001257 = A + B * 1000

doing some math..

A = 0,000157

B = 0,0000011

so the formula for the CPU Cost of a clustered index is:

CPU Cost =  0,000157 + 0,0000011 * NRow


Theese A and B are two, so called, magic number!

 

In the next part...


In the next part we will see the costs of other operators, for example the JOIN operator Hash Match and Nested Loop


Finally, after having seen the costs of all the operators we will sum up ..

But for now (in order not to put too much meat on the fire) we stop here!
So, wait for you in the comments for any doubt, question or curiosity.

Luca


Comments

I Post più popolari

SQL Server, datetime vs. datetime2

SQL Server, execution plan and the lazy spool (clearly explained)

La clausola NOLOCK. Approfondiamo e facciamo chiarezza!