Disassembling the SQL Server parser. The LGetToken procedure - part 1
Happy Sunday my dear friends and welcome back to this blog!
Today we will talk about the parser component of SQL Server.
We keep disassemble the parser as we already done here:
- SQL Server, inside the parser: the Get_Gen_Lex procedure
- SQL Server, A bit of reverse engineering inside the parser: the Parser and the GetChar procedure. (attention contains news published for the first time)
I don't think it has ever been done before and I don't even know if I will end up ...maybe I could probably write a book about it, who knows!
...but today I want to show you a few interesting things.
So start your management studio enter a simple select command, start a debugger like WINDBG
...and of course "happy reading"!
The LGetToken procedure
The LGetToken procedure is so defined:
void LGetToken(uint **param_1,uint *param_2,ushort **param_3)
The param_2 contain the pointer to the sql command.
This procedure is called by the yylex procedure in order to parse the T-SQL commands you send to the database engine of SQL Server.
Inside this procedure you will find and validate the statements inputed as token ("Select", "From", "Customer")
Our tokens are decoded while functions GetChar and Gen_Get_Lex (we talked about in the past posts) are used.
Taking a look to the source code, we can find this part where we are parsing the command string skipping all the not printable charaters.
Take a look to the Get_Gen_Lex
Now you know that you can add not printable character befor your statement: simply they will be skipped!
Now is searched for the charater N (0x4e) followed by the character ' (0x27).
The N' prefix indicate the unicode prefix.
For now we don't care if we found a "$" (0x24)
Another step ahead!
...And we are exactly reading the 'S' char.
We have a SWITCH statement.
The Get_Gen_Lex will return the value "1" since means all the alphatetical characters from 'a' to 'z' plus characters: "_" , "@" , "#"
Let's see now what's happen..
We are parsing the input stream char by char (rememeber that the current charater is in params[164]).
We will continue reading char by char until we read an line feed (0xlf) character.
Ok we have almost identified a token.
TO BE CONTINUED!
Thanks to all! I wish you a great great week!
Please subscribe to this blog if you found it helpful!
But again more important remember to think big and always share the knowledge!
Thank you! I am happy you like it!
ReplyDelete