SQL Server, Math and Applications


Hi Guys!

It's been a while since I wanted to write an article like this.
I had it in my head and it reminded me of one of those articles I read in computer magazines many years ago.
Many years ago computer science, at least for me, always had a background of mystery and excitement due to the discovery of something new.
Look what you can do with the computer! people said!

In the last article we saw how in SQL Server, using a float data type, when we store a number very often we store a slightly different one. This is the reason why rounding problems arise during certain operations.
You can read it here:SQL Server, DECIMAL and FLOAT data types. Are you ready to deep dive into the secrets of SQL Server?

Today we totally change the subject .. 

Let's see some mathematical concepts and then we will try to put into practice through SQL Server in order to get a new functionality that will be very useful!



Carissimi lettori,

E' un po che volevo scrivere un articolo come questo.
L'avevo in testa e mi ricordava uno di quegli articoli che leggevo sulle riviste di informatica tanti anni fa.
Allora l'informatica, almeno per me, aveva sempre un fondo di mistero ed una eccitazione dovuta alla scoperta di qualcosa di nuovo.
Guarda cosa si può fare con il computer, si diceva!

Nello scorso articolo abbiamo visto come in SQL Server, utilizzando un tipo di dato float, quando memorizziamo un numero molto spesso ne memorizziamo uno leggermente diverso. Questo è la causa per cui durante certe operazioni sorgono problemi di arrotondamento.
Lo potete leggere qui: SQL Server, DECIMAL and FLOAT data types. Are you ready to deep dive into the secrets of SQL Server?

Ma oggi cambiamo totalmente discorso..

Vediamo alcuni concetti matematici (eh, un po di teoria vi tocca oggi!) che poi cercheremo di mettere in pratica attraverso SQL Server al fine di ottenere una nuova funzionalità che vedrete essere molto utile.



Introduction


You know when, for example, you have a customer database and you need to look for everyone whose name is LUCA?
Simple you say, just use the LIKE operator ...

SELECT * FROM ANAGRAFICA WHERE NOME LIKE '%LUCA%'
       
Yep, it's true!
But then we realize that many times the registries also contain many typos.
How many times have you found LICA, LUICA ... and all the possible combinations instead of LUCA?

So how do we do? 
The question is: how can we intercept all possible variants of the string we are looking for?

Having this function in SQL Server would be very useful, but it does not exist as standard.
We then ask for help with math!




Avete presente quando ad esempio avete un anagrafica di clienti e dovete cercare tutti quelli il cui nome è LUCA?
Semplice direte, basta usare l'operatore LIKE...

SELECT * FROM ANAGRAFICA WHERE NOME LIKE '%LUCA%'
      
Già, è vero!
Poi però ci accorgiamo che tante volte le anagrafiche contengono anche tanti refusi.
Quante volte al posto di LUCA avete trovato LICA, LUICA ...e tutte le possibili combinazioni?

Quindi come facciamo? 
La domanda è: come possiamo  intercettare tutte le possibile varianti della stringa che cerchiamo?

Avere questa funzione in SQL Server sarebbe utilissimo ma, standard, non esiste.
Chiediamo quindi aiuto alla matematica!




The Levenshtein's metric


   In information theory the Levenshtein metric or distance is a measure of the difference between two strings.
Created by Russian scientist Vladimin Levenshtein in the 1960s, it determines how two strings of characters A and B are similar based on the minimum number of elementary modifications that allow A to be transformed into B.

Where by elementary modification we mean: 
  • deleting a character, 
  • replacing one character with another 
  • entering a character

Said so it seems simple, but if we tried to make it through the T-SQL language?

The trick (if we can define it this way) is to imagine a matrix of dimensions m x n where m is the size of the string A and n the size of the string B.

Let's imagine then to put our two strings to compare as x-axes and ordinates of a matrix D:


Now, we indicate with the notation D (e, f) the "edit distance" between a string of length "e" and a string of length "f"

The proposed algorithm becomes the following:

Step 1

We set the initial states in the matrix D on the first row and on the first column:

1) The distance between a string of length "e" and a null string requires "and" deletions: D (e, 0) = e
2) The distance between a null string and a string of length f requires "f" insertions: D (0, f) = f


Step 2


 
For each cell of our matrix we carry out the above calculation.

In practice,

Let's try with e = 1 and f = 1 we will have:


D(1,1) = MIN(   
D(1-1,1) + 1 = D(0,1) +1 = 2         (elemento più a sinistra)
                               D(1,1-1) = D(1,0) + 1 = 2               (elemento più in alto)
                               D(0,0) + 1 sub                                   (elemento a SX e in alto)
                               D(0,0) + 0 match
) = 0

Let's try with e = 1 and f = 2 we will have:

D(1,2) = MIN(
                               D(0,2)+1 = 2 +1 = 3
                               D(1,1)+1 = 0 +1 = 1
                               D(0,1) + 1 = 1 + 1 = 2
                               D(0,1) + 0 = 1
                ) = 1

Let's try with e = 1 and f =3 we will have: 

D(1,3) = MIN(
                               D(0,3)+1 = 3 +1 = 4
                               D(1,2)+1 = 1 +1 = 2
                               D(0,2) + 1 = 2 + 1 = 3
                               D(0,2) + 0 = 2
                ) = 1

And so on up to the last cell where the searched value will remain.

 

Nella teoria dell'informazione la metrica o distanza di Levenshtein è una misura della differenza tra due stringhe.
Creata dallo scienziato russo Vladimin Levenshtein negli anni sessanta determina quanto due stringhe di caratteri A e B sono simili in base al numero minimo di modifiche elementari che consentono di trasformare la A nella B.

Dove per modifica elementare si intende:
  • la cancellazione di un carattere,
  • la sostituzione di un carattere con un altro
  • l'inserimento di un carattere 
Detta così sembra semplice, ma se provassimo a realizzarla tramite il linguaggio T-SQL?

Il trucco (se così lo possiamo definire) è quello di immaginarci una matrice di dimensioni m x n dove m è la dimensione della stringa A ed n la dimensione della stringa B.

Immaginiamo quindi di mettere le nostre due stringhe da confrontare come ascisse ed ordinate di una matrice D:



Ora, indichiamo con la notazione D(e,f) la “distanza di edit” tra una stringa di lunghezza “e” ed una stringa di lunghezza “f“

L’algoritmo proposto diventa il seguente:

Passo 1

Impostiamo nella matrice D sulla prima riga e sulla prima colonna gli stati iniziali: 

1) La distanza tra una stringa di lunghezza “e” ed una stringa nulla richiede “e” cancellazioni: D(e,0) = e
2) La distanza tra una stringa nulla ed una stringa di lunghezza f richiede “f” inserimenti:  D(0,f) = f




Passo 2


 


Per ogni cella della nostra matrice effettuiamo il calcolo sopra riportato.

Nella pratica,

Proviamo con il e=1 ed f= 1 avremo:

D(1,1) = MIN(   
D(1-1,1) + 1 = D(0,1) +1 = 2         (elemento più a sinistra)
                               D(1,1-1) = D(1,0) + 1 = 2               (elemento più in alto)
                               D(0,0) + 1 sub                                   (elemento a SX e in alto)
                               D(0,0) + 0 match
) = 0

Proviamo con il e=1 ed f= 2 avremo:

D(1,2) = MIN(
                               D(0,2)+1 = 2 +1 = 3
                               D(1,1)+1 = 0 +1 = 1
                               D(0,1) + 1 = 1 + 1 = 2
                               D(0,1) + 0 = 1
                ) = 1

Proviamo con il e=1 ed f= 3 avremo:

D(1,3) = MIN(
                               D(0,3)+1 = 3 +1 = 4
                               D(1,2)+1 = 1 +1 = 2
                               D(0,2) + 1 = 2 + 1 = 3
                               D(0,2) + 0 = 2
                ) = 1


E così via fino all’ultima cella in cui rimarrà il valore cercato.


SQL and the Levenshtein's metric

  First create the skeleton of the function.
We will need two input parameters which are respectively our two strings to compare.

The function will return an integer.
       
CREATE function LEVENSHTEIN( 
  @SourceString nvarchar(100),  
  @TargetString nvarchar(100)
)
returns int
as
BEGIN
END 


Now, the typical implementation of this algorithm uses a matrix.
However, since this type of data is not present in T-SQL, let's see how to implement it using an array of characters.

We then define our character array:

@Matrix Nvarchar(4000)

With a bit of maths we have:

D[X,Y] = Substring(@Matrix,Y*(K+1)+X-1+1,1))

Dove K = LEN(@SourceString)



Well, now let's write step 1 to initialize the values ​​on the first row and on the first column:

CREATE function LEVENSHTEIN( 
  @SourceString nvarchar(100),  
  @TargetString nvarchar(100)
)
returns int
as
BEGIN

  DECLARE @Matrix Nvarchar(4000),  
  @LD int, @TargetStringLength int, @SourceStringLength int,
  @ii int, @jj int, @CurrentSourceChar nchar(1), @CurrentTargetChar nchar(1),@Cost int,
  @Above int,@AboveAndToLeft int,@ToTheLeft int, @MinimumValueOfCells int

  if @SourceString is null or @TargetString is null return null
  Select @SourceStringLength=LEN(@SourceString),
       @TargetStringLength=LEN(@TargetString),
       @Matrix=replicate(nchar(0),(@SourceStringLength+1)*(@TargetStringLength+1))
  If @SourceStringLength = 0 return @TargetStringLength
  If @TargetStringLength = 0 return @SourceStringLength
  if (@TargetStringLength+1)*(@SourceStringLength+1)> 4000 return -1
  --  Initialize the first row to 0..n.

  SET @ii=0
  WHILE @ii<=@SourceStringLength
  BEGIN
     SET @Matrix=STUFF(@Matrix,@ii+1,1,nchar(@ii))--d(i, 0) = i
     SET @ii=@ii+1
  END

  -- Initialize the first column to 0..m.

  SET @ii=0
  WHILE @ii<=@TargetStringLength
  BEGIN
     SET @Matrix=STUFF(@Matrix,@ii*(@SourceStringLength+1)+1,1,nchar(@ii))--d(0, j) = j
     SET @ii=@ii+1
  END
END 


Then add to the function the step 2:

CREATE function LEVENSHTEIN( 
  @SourceString nvarchar(100),  
  @TargetString nvarchar(100)
)
returns int
as
BEGIN

  DECLARE @Matrix Nvarchar(4000), @LD int, @TargetStringLength int, @SourceStringLength int,
  @ii int, @jj int, @CurrentSourceChar nchar(1), @CurrentTargetChar nchar(1),@Cost int,
  @Above int,@AboveAndToLeft int,@ToTheLeft int, @MinimumValueOfCells int
  -- Step 1: Set n to be the length of s. Set m to be the length of t.
  -- If n = 0, return m and exit.
  -- If m = 0, return n and exit.
  -- Construct a matrix containing 0..m rows and 0..n columns.
  if @SourceString is null or @TargetString is null return null
  Select @SourceStringLength=LEN(@SourceString),
  @TargetStringLength=LEN(@TargetString),
  @Matrix=replicate(nchar(0),(@SourceStringLength+1)*(@TargetStringLength+1))
  If @SourceStringLength = 0 return @TargetStringLength
  If @TargetStringLength = 0 return @SourceStringLength
  if (@TargetStringLength+1)*(@SourceStringLength+1)> 4000 return -1


  --  Initialize the first row to 0..n.

  SET @ii=0
  WHILE @ii<=@SourceStringLength
  BEGIN
     SET @Matrix=STUFF(@Matrix,@ii+1,1,nchar(@ii))--d(i, 0) = i
     SET @ii=@ii+1
  END

  -- Initialize the first column to 0..m.

  SET @ii=0
  WHILE @ii<=@TargetStringLength
  BEGIN
     SET @Matrix=STUFF(@Matrix,@ii*(@SourceStringLength+1)+1,1,nchar(@ii))--d(0, j) = j
     SET @ii=@ii+1
  END
  --Passo 2: 
  -- Examine each character of t (j from 1 to m).
 

  SET @ii=1
  WHILE @ii<=@SourceStringLength
  BEGIN
    -- Examine each character of t (j from 1 to m).
    SET @jj=1
    WHILE @jj<=@TargetStringLength
    BEGIN
        -- For each row
        Select
 
        --Set cell d[i,j] of the matrix equal to the minimum of:
        --a. The cell immediately above plus 1: d[i-1,j] + 1.
        --b. The cell immediately to the left plus 1: d[i,j-1] + 1.
        --c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost
        @Above=unicode(substring(@Matrix,@jj*(@SourceStringLength+1)+@ii-1+1,1))+1,
        @ToTheLeft=unicode(substring(@Matrix,(@jj-1)*(@SourceStringLength+1)+@ii+1,1))+1,
        @AboveAndToLeft=unicode(substring(@Matrix,(@jj-1)*(@SourceStringLength+1)+@ii-1+1,1))
         + case when (substring(@SourceString,@ii,1)) = (substring(@TargetString,@jj,1))
            then 0 else 1 end 
        
        --the cost
        -- If s[i] equals t[j], the cost is 0.
        -- If s[i] doesn't equal t[j], the cost is 1.
        -- now calculate the minimum value of the three
        if (@Above < @ToTheLeft) AND (@Above < @AboveAndToLeft)
            select @MinimumValueOfCells=@Above
        else if (@ToTheLeft < @Above) AND (@ToTheLeft < @AboveAndToLeft)
            select @MinimumValueOfCells=@ToTheLeft
        else
            select @MinimumValueOfCells=@AboveAndToLeft
        Select @Matrix=STUFF(@Matrix,
                   @jj*(@SourceStringLength+1)+@ii+1,1,
                   nchar(@MinimumValueOfCells)),
           @jj=@jj+1
    END
    SET @ii=@ii+1
  END    
  --After iteration steps are complete, distance is found in cell d[n,m]
  return unicode(substring(
     @Matrix,@SourceStringLength*(@TargetStringLength+1)+@TargetStringLength+1,1
     ))
END
       

Our function is finished and we can try it!
Let's write for example:

SELECT dbo.LEVENSHTEIN('romani','ramona')

We obtain the value 3!
 

 

Per realizzare la nostra funzione creaimone per prima cosa lo scheletro.
Ci serviranno due parametri in ingresso che sono rispettivamente le nostre due stringhe da confrontare.

La funzione ritornerà un intero.
       
CREATE function LEVENSHTEIN( 
  @SourceString nvarchar(100),  
  @TargetString nvarchar(100)
)
returns int
as
BEGIN
END 
 
Ora, l’implementazione tipica di questo algoritmo utilizza una matrice.
Non essendo però presente in T-SQL questo tipo di dato vediamo come implementarlo tramite un array di caratteri.

Definiamo quindi il nostro array di caratteri:

@Matrix Nvarchar(4000)


E con un po di matematica abbiamo che:

D[X,Y] = Substring(@Matrix,Y*(K+1)+X-1+1,1))

Dove K = LEN(@SourceString)



Bene, ora scriviamo il passo 1 per inizializzare i valori sulla prima riga e sulla prima colonna:

CREATE function LEVENSHTEIN( 
  @SourceString nvarchar(100),  
  @TargetString nvarchar(100)
)
returns int
as
BEGIN

  DECLARE @Matrix Nvarchar(4000),  
  @LD int, @TargetStringLength int, @SourceStringLength int,
  @ii int, @jj int, @CurrentSourceChar nchar(1), @CurrentTargetChar nchar(1),@Cost int,
  @Above int,@AboveAndToLeft int,@ToTheLeft int, @MinimumValueOfCells int

  if @SourceString is null or @TargetString is null return null
  Select @SourceStringLength=LEN(@SourceString),
       @TargetStringLength=LEN(@TargetString),
       @Matrix=replicate(nchar(0),(@SourceStringLength+1)*(@TargetStringLength+1))
  If @SourceStringLength = 0 return @TargetStringLength
  If @TargetStringLength = 0 return @SourceStringLength
  if (@TargetStringLength+1)*(@SourceStringLength+1)> 4000 return -1
  --  Initialize the first row to 0..n.

  SET @ii=0
  WHILE @ii<=@SourceStringLength
  BEGIN
     SET @Matrix=STUFF(@Matrix,@ii+1,1,nchar(@ii))--d(i, 0) = i
     SET @ii=@ii+1
  END

  -- Initialize the first column to 0..m.

  SET @ii=0
  WHILE @ii<=@TargetStringLength
  BEGIN
     SET @Matrix=STUFF(@Matrix,@ii*(@SourceStringLength+1)+1,1,nchar(@ii))--d(0, j) = j
     SET @ii=@ii+1
  END
END 


Adesso aggiungiamo alla funzione il passo 2:


CREATE function LEVENSHTEIN( 
  @SourceString nvarchar(100),  
  @TargetString nvarchar(100)
)
returns int
as
BEGIN

  DECLARE @Matrix Nvarchar(4000), @LD int, @TargetStringLength int, @SourceStringLength int,
  @ii int, @jj int, @CurrentSourceChar nchar(1), @CurrentTargetChar nchar(1),@Cost int,
  @Above int,@AboveAndToLeft int,@ToTheLeft int, @MinimumValueOfCells int
  -- Step 1: Set n to be the length of s. Set m to be the length of t.
  -- If n = 0, return m and exit.
  -- If m = 0, return n and exit.
  -- Construct a matrix containing 0..m rows and 0..n columns.
  if @SourceString is null or @TargetString is null return null
  Select @SourceStringLength=LEN(@SourceString),
  @TargetStringLength=LEN(@TargetString),
  @Matrix=replicate(nchar(0),(@SourceStringLength+1)*(@TargetStringLength+1))
  If @SourceStringLength = 0 return @TargetStringLength
  If @TargetStringLength = 0 return @SourceStringLength
  if (@TargetStringLength+1)*(@SourceStringLength+1)> 4000 return -1


  --  Initialize the first row to 0..n.

  SET @ii=0
  WHILE @ii<=@SourceStringLength
  BEGIN
     SET @Matrix=STUFF(@Matrix,@ii+1,1,nchar(@ii))--d(i, 0) = i
     SET @ii=@ii+1
  END

  -- Initialize the first column to 0..m.

  SET @ii=0
  WHILE @ii<=@TargetStringLength
  BEGIN
     SET @Matrix=STUFF(@Matrix,@ii*(@SourceStringLength+1)+1,1,nchar(@ii))--d(0, j) = j
     SET @ii=@ii+1
  END
  --Passo 2: 
  -- Examine each character of t (j from 1 to m).
 

  SET @ii=1
  WHILE @ii<=@SourceStringLength
  BEGIN
    -- Examine each character of t (j from 1 to m).
    SET @jj=1
    WHILE @jj<=@TargetStringLength
    BEGIN
        -- For each row
        Select
 
        --Set cell d[i,j] of the matrix equal to the minimum of:
        --a. The cell immediately above plus 1: d[i-1,j] + 1.
        --b. The cell immediately to the left plus 1: d[i,j-1] + 1.
        --c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost
        @Above=unicode(substring(@Matrix,@jj*(@SourceStringLength+1)+@ii-1+1,1))+1,
        @ToTheLeft=unicode(substring(@Matrix,(@jj-1)*(@SourceStringLength+1)+@ii+1,1))+1,
        @AboveAndToLeft=unicode(substring(@Matrix,(@jj-1)*(@SourceStringLength+1)+@ii-1+1,1))
         + case when (substring(@SourceString,@ii,1)) = (substring(@TargetString,@jj,1))
            then 0 else 1 end 
        
        --the cost
        -- If s[i] equals t[j], the cost is 0.
        -- If s[i] doesn't equal t[j], the cost is 1.
        -- now calculate the minimum value of the three
        if (@Above < @ToTheLeft) AND (@Above < @AboveAndToLeft)
            select @MinimumValueOfCells=@Above
        else if (@ToTheLeft < @Above) AND (@ToTheLeft < @AboveAndToLeft)
            select @MinimumValueOfCells=@ToTheLeft
        else
            select @MinimumValueOfCells=@AboveAndToLeft
        Select @Matrix=STUFF(@Matrix,
                   @jj*(@SourceStringLength+1)+@ii+1,1,
                   nchar(@MinimumValueOfCells)),
           @jj=@jj+1
    END
    SET @ii=@ii+1
  END    
  --After iteration steps are complete, distance is found in cell d[n,m]
  return unicode(substring(
     @Matrix,@SourceStringLength*(@TargetStringLength+1)+@TargetStringLength+1,1
     ))
END
       

La nostra funzione è terminata e la possiamo provare.
Scriviamo ad esempio:

SELECT dbo.LEVENSHTEIN('romani','ramona')

Ed otterremo il valore 3.



See you soon guys and remember to subscribe to this blog please!

Luca Biondi @ SQLServerPerformance blog!


Next post
:SQL Server, Il vostro "super cliente" è bloccato! (MSQL_DQ e PREEMPTIVE_COM_QUERYINTERFACE Wait states)
Previous post: SQL Server, DECIMAL and FLOAT data types. Are you ready to deep dive into the secrets of SQL Server? 
 

Comments

I Post più popolari

SQL Server, datetime vs. datetime2

SQL Server, execution plan and the lazy spool (clearly explained)

La clausola NOLOCK. Approfondiamo e facciamo chiarezza!