Inside the SQL Server Query Optimizer - part 3 Cardinality estimation

Hi guys!

After the simplification phase discussed in the last article, today we will speak about another fundamental step called Cardinality Estimation.

During this step the optimizer try to predict the number of rows returned by each node of the execution plan.
As you can imagine this is a step of primary importance because a good prediction generate an accurate execution plan with a lower processing cost to execute.

The statistics

In order to estimate the number of rows returned by a query the Cardinality Estimator use so called statistics.

By default for each column and each index created, SQL Server create it's relative statistic.

But what is a statistic?

A statististic for query optimization is a binary large object called BLOBs that contain statistical information about the distribution of values for one o more columns of a table.

We can observe statistics through the T-SQL command:

       
DBCC SHOW_STATISTICS ( <TABLE> , <FIELD_NAME> )

The execution of this command will show you a grid with a bunch of rows called histogram statistics.

How to read the Histogram above?

For a value in the RANGE_HI_KEY columns (eg."214230003200000") the EQ_ROWS will tell your how much rows there are (eg. 4 rows).

Single Predicate (the simplest case)

Now, we will start with the simplest case.
Estimate the number of rows qualified returned by a single query predicate.
This is an easy case for the Optimizer.

Suppose to execute this following simple query:

       
SELECT * FROM <TABLE> WHERE FIELD_NAME = '214230003200000'

Looking at the execution plan, using the statistics, the optimizer correctly indicate an estimated Number of Rows equal to 4.

Multiple predicates

As we have just seen, if we have a Query where the WHERE clause has only one predicate everything is simple!

If instead we have more than one predicate? ...what happens?

Suppose having two predicate, we can go back to two extreme cases:

One of the two sets generated by the (two) predicates, considered individually, overlaps the other.
The two sets are disjoint.

Taking into consideration for the moment only the case in which the two predicates are connected by an AND operator

In the first case the result will be the cardinality of the part in common of the two set, while in the second case the result will be the empty set.

What is not known is the cardinality of the elements in common to the two sets.

For this reason, it is also easy to understand that having only statistics for single fields it is not possible to calculate an exact cardinality value.

As we will see below, there are formulas that allow us to estimate this value!

An example with two predicates connected by an AND operator

Given a PERSON table with these fields:

Suppose to execute this Query with two conditions in the predicate:

       
Select CAP, PIVA from Person where (CAP BETWEEN '10000' AND '400000') AND  PIVA = '00759680218'

A) When I search the CAP field (CAP BETWEEN '10000' AND '400000')

Using statistics, estimated number of row is equal to 111

B) When I search the PIVA field (PIVA = '00759680218')

Using statistics, estimated number of row is equal to 13

And now? ...what is the number of rows estimated when the two conditions are placed in AND?

Well, before SQL Server 2014 the two conditions are considerated independent among them.

So the Optimizer apply this formula:

So, ( 13 * 111 ) / 512 = 2,818

Sql Server 2014 introduce a new cardinality estimator (CE) who assume some correlation between the predicates.

New CE adopt a new formula called exponential backoff algorithm:

In this case we have:

So, 512 * ( 13 / 512) * SQRT(111 / 512) = 6,052

And actually:

And now another question!

What cardinality estimator produced the better rows number estimation?

OLD CE = 2,8
NEW CE = 6,4

Our Query return ….

…. 13 Rows!

In this case the NEW CE produce a better estimation for this Query but in general it depends:

If you have data with correlating predicates the new CE is likely to give you a better estimate and therefore a better plan.
However, if your predicates have no correlation, you may find the legacy CE to provide a better estimate.

That's all for today!
I hope you liked this article.
Look forward to the next part of this series of articles

Luca Biondi @ SQLServerPerformance blog!

Next post: SQL Server, Again info about statistics and the "Ascending Key Problem"

Previous post: Inside the SQL Server Query Optimizer - part 2 All about the Simplification

Search This Blog

SQL Server Performance Blog where milliseconds matter