A bit of theory of databases: The Halloween problem and the Table Spool
Hi Guys,
Welcome to this second post of the year.
Today we will talk a little bit of theory, we will talk a little bit of database theory in general and therefore not specifically of our SQL Server.
Enjoy yourselves!
The Halloween problem
I don't know about you but "Halloween problem" makes me think of one of those old fairy tales that always hide a wise warning, one of those fairy tales lost in a past time which in our case is however the period in which relational databases were born and started to develop ... but let me tell you the whole story!
This story begins around the mid-1970s.
Ted Codd, a researcher working at IBM, had conceived relational databases a few years earlier. Also in those years both the idea of data normalization (who does not remember the third normal form learned at university?) And the ACID properties of transactions (we talked about it here) were born.
It was the golden age of databases but at the same time pioneering!
But let's go back to our Codd. His ideas materialized in a software called System R which is the first relational database prototype equipped with a first implementation of the SQL language.
It was 1974.
Codd had developed the system R at IBM's San Jose Research Laboratory and in those years many IBM engineers wanted to follow and support him and among these are the well-known Ralph Gomory, Jim Gray and Don Chamberlain.
The new SQL language was structured like a database query. Using the appropriate syntax, it is sufficient to specify which data to extract and the database engine should have independently found the best way to extract the data.
A lot of research was done to study query optimization.
And here we are approaching our Halloween problem!
Two of Chamberlein's colleagues, Selinger and Astrahan were studying the behavior of the optimizer in presence of indexes, when they faced this specific case:
Give a ten percent raise to everybody who earns less than 25.000 dollars.
On the employee table there was an index on the salary field and the query was this:
UPDATE e
SET e.SALARY = e.SALARY * 1.1
FROM EMPLOYEE e
WHERE
e.SALARY < 25000
Surprisingly, the result was that all employees were earning over $ 25,000!
The two researchers had found a problem and went to Chamberlein's office to report it to him.
But it was Friday and so Chamberlein said, "We won't be able to solve this problem today, let's give it a name and we'll look at it on Monday".
Since it was Halloween, this problem got that name!
Now that you know the origin of the curious name of this problem, I'll tell you why the query is working incorrectly!
What should the optimizer do to resolve the query?
Well, first it must surely find all the employees who earn less than $ 25,000 and the best way to get data is use the index on the SALARY field. Data are sorted by SALARY.
We start processing the first employee who is the one with the lowest salary. If he earns less than $ 25,000, we'll give him the raise.
If we assume that he first earned $ 10,000, then he will earn $ 11,000 later, but this will however cause a jump in the index position: to be precise, it would go on in order of salary.
However, this means that continuing with the subsequent employees will happen to meet again this same employee to whom we will improperly give another salary increase!
This is the Halloween problem explained in the simplest way I can.
How to remedy this problem?
The first solution that was thought of was trivial .. avoid using the index on the SALARY field.
Later it was also thought of another alternative way. That is to use the index on the SALARY to read the data quickly but to have the data reading phase followed by another operator.
This operator would have copied the data into a support table (in tempdb) and the calculations would have been performed starting on this table: This is the TABLE SPOOL operator
SQL Server and the Table Spool
Now let's see if SQL Server behaves as the theory says ..
Let's take an example.
I create the Employees table, create the indexes and fill it with some data.
CREATE TABLE EMPLOYEE
(ID INT IDENTITY(1,1),
[NAME] VARCHAR(80),
SALARY FLOAT)
CREATE CLUSTERED INDEX IDX_EMPLOYEE_ID ON EMPLOYEE(ID)
CREATE INDEX IDX_EMPLOYEE_SALARY ON EMPLOYEE(SALARY)
INSERT INTO EMPLOYEE (NAME,SALARY) VALUES ('LUKE', 5000)
INSERT INTO EMPLOYEE (NAME,SALARY) VALUES ('LUKE', 7000)
INSERT INTO EMPLOYEE (NAME,SALARY) VALUES ('LUKE', 9000)
INSERT INTO EMPLOYEE (NAME,SALARY) VALUES ('LUKE', 11000)
INSERT INTO EMPLOYEE (NAME,SALARY) VALUES ('LUKE', 13000)
INSERT INTO EMPLOYEE (NAME,SALARY) VALUES ('LUKE', 15000)
INSERT INTO EMPLOYEE (NAME,SALARY) VALUES ('LUKE', 17000)
Now execute the "historical" update:
UPDATE e
SET e.SALARY = e.SALARY * 1.1
FROM EMPLOYEE e
WHERE
e.SALARY < 25000
Looking at the execution plan we can see that the index is not used!
And if i try to force SQL Server to use the index on the SALARY field?
What will happen?
UPDATE e
SET e.SALARY = e.SALARY * 1.1
FROM EMPLOYEE e WITH (INDEX (IDX_EMPLOYEE_SALARY))
WHERE
e.SALARY < 25000
Well..SQL Server will add the Table Spool operator!
P.S. this post may be related to this SQL Server, Avoid that damn Table Spool! which I recommend reading to anyone involved in tuning.
That'all for today!!
Luca
Previous post: SQL Server, how to save images and attachments in the database. Is better to store images into the database or into the filesystem?
Maybe the best explanation I have seen on this
ReplyDeleteThe easiest explanation of Halloween problem and Table Spool I have seen ever.
ReplyDelete