Databases and surroundings: Wikidata, SPARQL & Scarlett Johansson

Hi Guys,

Welcome to this a bit off topic post.
I must admit I am a huge fan of wikipedia.
Wikipedia is free multilingual online encyclopedia born with the intent to collect all human knowledge.
But not only is it free and multilingual, it is also collaborative. 
Anyone can contribute according to certain criteria.

If Wikipedia aims to collect all the human knowledge, there is also a parallel wiki that instead aims to catalog all this knowledge.
We are talking about wikidata.

Like Wikipedia, Wikidata is also a free and open knowledge base that can be read and edited by humans and machines. 
 
In addition, Wikidata serves as a central repository for the structured data of Wikimedia projects, including Wikipedia but also Wikivoyage, Wiktionary, Wikisource and others. 

Returning to practical, for example, every page of wikipedia is linked to an element of wikidata.
Wikidata is therefore responsible for cataloging all this information through a series of properties and classifiers definably users.

Wikidata in fact is nothing but a large database and like every database there will be a language to query it.
 

Wikidata

If we take for example the wikipedia page dedicated to the planet earth


We can see, as for every wikipedia page, the link "Edit link"

By clicking on this link we will go on the entry dedicated to this entry on Wikidata.
 

 
Each entry (the Q2 in this example) on the Wikidata database has a set of statements or properties
 
For example "Instance of" is a property (the property P31) 

A property may have one or more values, in the example for the item "earth planet" the property "Instance of" has 3 values: "terrestrial planet", "inner planet of the solar system" and "geographic region".
In this case this property accepts another Wikidata iteam as a value.
Other properties could accept for example a numeric value.
 
For example the property "population" accept as a valid value an integer value.
 

 
In the photo above the data shows that the population of the earth in the year 40 ECB was 7 million inhabitants (plus a tolerance of one million).

This whole system was created in order to organize and catalogue all kinds of information.
Sometimes, for example, a new property is proposed.
In this case it is discussed and in the end it is voted and if the majority expresses a positive vote the property is created.

Clearly each element will have a certain set of properties.
As the element "planet earth" will have the properties "population", "highest point" and so on, for example the element Scarlett Johansson as "human being" will have others:
 

 
In this case the element Q34436 that represent the actress Scarlett Johansson has the properties:
  • "Instance of" (P31) with value "human" (Q5)
  • "Image" (P5) that contain an image as value
  • "Sex"  (P21) with value "female" (Q6581072)
Many other properties follow...
 
Now, If we wanted to know for example all the female actresses, born in the year 1987 and who made a certain film? or if we wanted to know the list of planets in the solar system?
 
or

Now that we have all this data, how do we interrogate them?


How to interrogate wikidata: SPARQL

We can access to this huge amount of data using a language called SPARQL.
 
You must enter your SPARQL query into the wikidata query service page at this address https://query.wikidata.org/


This is an example of SPARQL query that return the list of actress from United States:

#American actresses living
SELECT ?item ?itemLabel ?itemDescription ?height (SAMPLE(?img) AS ?image) (SAMPLE(?dob) AS ?dob) ?sl
WHERE {  
    
  ?item wdt:P106 wd:Q33999 ;
          wdt:P27 wd:Q30 ;
          wdt:P21 wd:Q6581072 .
  MINUS { ?item wdt:P570 [] }
  OPTIONAL { ?item wdt:P2048 ?height }
  OPTIONAL { ?item wdt:P18 ?img }
  OPTIONAL { ?item wdt:P569 ?dob } 
  OPTIONAL { ?item wikibase:sitelinks ?sl } 
    SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en"}
} GROUP BY ?item ?itemLabel ?itemDescription ?height  ?sl
ORDER BY DESC(?sl)

        
If you insert this query  and the press the blue arrow you will obtain this result:

This is not a SPARQL course at all, so let’s just outline the query we just used.

If you know the SQL language you will recognize some common elements:

SELECT, WHERE, GROUP BY, ORDER BY token also exist in SPARQL:

 


Then logic becomes a little different..

"?item" represent all the elements of wikidata so you need to apply some filters to extract the data you want.

you can filter by applying filters as shown below:


This filter apply 3 conditions.

The first condition is:

    
  ?item wdt:P106 wd:Q33999;
        wdt:P27 wd:Q30 ;
        wdt:P21 wd:Q6581072 .
     

In this case we are asking to extract only those items where

the property P106 is equal to the item Q33999

Property P106 means "occupation of a person" and  Q33999 is the entry "Actor"

 

Similarly, the second condition means "country of citizenship" (P27) must be "United States of America" (Q30)

    
  ?item wdt:P106 wd:Q33999;
        wdt:P27 wd:Q30 ;
        wdt:P21 wd:Q6581072 .
     
Finally the third condition means "Sex or Gender" (P21) must be "Female" (Q6581072)
    
  ?item wdt:P106 wd:Q33999;
        wdt:P27 wd:Q30 ;
        wdt:P21 wd:Q6581072 .
     

That’s really nice, isn’t it?

Another basic thing to know is how to extract and display information.

For example, we have the list of actors and we want to see their date of birth or height.

We can see that in the example we have extracted in the SELECT the height
The height value is contained in the field "?height" and to extract it we used the token OPTIONAL..

The Token OPTIONAL allows you to specify a property that in this case is property P2048. This property is the height.

Of course, as mentioned, this is just an introduction. SPARQL syntax can do much more and become much more complex.

But this weekend you can practice searching for the information that most interests you directly from wikidata.

I hope you enjoyed this post!

That's all for today,
~Luke





















Previous post: SQL Server 2022 RC 1 is out! What's new?

Comments

I Post più popolari

SQL Server, execution plan and the lazy spool (clearly explained)

SQL Server, datetime vs. datetime2

La clausola NOLOCK. Approfondiamo e facciamo chiarezza!