The Nearly Infinite Usefulness of SPARQL
We are now two-thirds of the way through our CWPK series. One reason we have emphasized ‘roundtripping‘ in Cooking with Python and KBpedia is to accommodate the incorporation of information from external sources into KBpedia. From hierarchical relationships to annotations like definitions or labels, external sources can be essential. Of course, one can find flat files or spreadsheets or CSV files directly, but often times we need specific information that can only come from querying the external source directly. Two of the ones we heavily rely on in particular — Wikidata and DBpedia — provide this access through SPARQL queries. We first introduced SPARQL in CWPK #25.
External SPARQL queries are the basis of getting instance data, values for instance attributes, missing fields like altLabels
and skos.definition
, existing crosswalks or mappings, longer descriptions, subsumption relations, related links, and interesting joins and intersections across external knowledge base content. Often, one is able to specify the format (serialization) of the desired results.
The outputs from these external queries can be manipulated as strings, and then written to flat files useful for ingest into the various build routines. Of course, it is important that the format and CSV-nature of the results be maintained in a form that the build routines expect. One may alter the build formats or the extract formats, but to work they need to match on both ends.
So, what we provide in today’s installment are some guidelines and recipes for using SPARQL to obtain information you need and to write them to flat files. Because of their importance, we emphasize Wikidata and DBpedia (also a stand-in for Wikipedia) in our examples. Once populated, you may need to do some intermediate wrangling of these files to get them into shape for direct import. We covered that topic in brief in CWPK #36, but really do not address file wrangling further here. There are way too many varieties to cover the topic in a meaningful way, though we certainly have examples in today’s installment and across the entire CWPK series that should provide a useful foundation to your own efforts.
Choosing Access Method
There are not that many public SPARQL endpoints available, and some are not always up and available. But the endpoints that do exist, with their identification in the Query Sources section at the conclusion of today’s installment, are often comprehensive and with high value. The two we will be emphasizing today, Wikidata and DBpedia (and, by extension, the linked open data (LOD) cloud beyond that), are among the most valuable. (Of course, many endpoints, like ones specific to a particular organization, are private, and can be parts of valuable, distributed information ecosystems.) Another notable endpoint worthy of your attention is the LOD endpoint maintained by OpenLink Software.
It is possible to query many of these sources directly online with an HTML interface, often also providing a choice of the output format desired. In some of the examples below, I provide a Try it! link that takes you directly to the source site and uses their native SPARQL interface. (Also, inspect the URI links for these Try it! options, since it shows how SPARQL gets communicated over the Web.) You may often find this is the fastest and cleanest way to get useful results, and sometimes better formatted than what our home-brewed options below produce. Your mileage may vary. In any case, it is useful to learn how to conduct direct SPARQL capabilities from within cowpoke. For that reason, I emphasize our home-brewed examples below.
Setting Up This Installment
Like we have been emphasing of late, we begin today’s installment with our standard start-up instructions:
from cowpoke.__main__ import *
from cowpoke.config import *
from owlready2 import *
from SPARQLWrapper import SPARQLWrapper, JSON
from rdflib import Graph
#sparql = SPARQLWrapper('http://dbpedia.org/sparql')
= SPARQLWrapper('https://query.wikidata.org/sparql')
sparql = world.as_rdflib_graph() graph
Of course, we actually have a very capable query method to our own internal stores:
= list(graph.query_owlready("""
form_1 PREFIX rc: <http://kbpedia.org/kko/rc/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT ?x ?label
WHERE
{
?x rdfs:subClassOf rc:Eutheria.
?x skos:prefLabel ?label.
}
"""))
print(form_1)
Wikidata Queries
For the following Wikidata queries, Run these assignments first:
from SPARQLWrapper import SPARQLWrapper, JSON
from rdflib import Graph
= SPARQLWrapper('https://query.wikidata.org/sparql', agent='cowpoke 0.1 (github.com/Cognonto/cowpoke)') sparql
We need to assign an ‘agent=’ because of limits Wikidata occasionally puts on queries. If you do many requests, you may want to consider adding your own agent defintion.
One of the techniques I use most heavily is the VALUES
statement. This construct allows a listing of IDs to be passed to the query source. Depending on various endpoint limits, you may be able to list 1000 or more IDs in such a listing; experience with a given endpoint will dictate. If you use the VALUES
construct, just make sure you are using the proper format and prefix (wd:
in this instance for a Q item within Wikidata) in front of each value.
Parent Class from Q IDs
The first query is to obtain the parent class from submitted listing of Q items. You may also Try it! directly from Wikidata:
"""
sparql.setQuery(PREFIX schema: <http://schema.org/>
SELECT ?item ?itemLabel ?wikilink ?itemDescription ?subClass ?subClassLabel WHERE {
VALUES ?item { wd:Q25297630
wd:Q537127
wd:Q16831714
wd:Q24398318
wd:Q11755880
wd:Q681337
}
?item wdt:P910 ?subClass.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
""")
sparql.setReturnFormat(JSON)= sparql.query().convert()
results
#for result in results["results"]["bindings"]:
# print(result["item"]["value"])
print(results)
Notice that once we set our SPARQL endpoint and user agent, we are able to cut-and-paste different SPARQL queries between the opening and ending triple quotes (“””). The bracketing statements around that can be used repeatedly for different queries.
Go ahead and toggle between the print
statements above to see how we can start varying outputs. Chances are you will need to do some string manipulation before your flat files are ready for ingest, but we can vary these specifications to get the initial output closer to our requirements.
subClass and Instance listings for Q ID
Try it! as well.
"""
sparql.setQuery(SELECT ?subclass ?subclassLabel ?instance ?instanceLabel
WHERE
{
?subclass wdt:P279 wd:Q183366.
?instance wdt:P31 wd:Q183366.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY xsd:integer(SUBSTR(STR(?subclass),STRLEN("http://www.wikidata.org/entity/Q")+1))
""")
sparql.setReturnFormat(JSON)= sparql.query().convert()
results
#for result in results["results"]["bindings"]:
# print(result["item"]["value"])
print(results)
Useful Q Item Attributes
"""
sparql.setQuery(PREFIX schema: <http://schema.org/>
SELECT ?item ?itemLabel ?class ?classLabel ?description ?article ?itemAltLabel WHERE {
VALUES ?item { wd:Q1 wd:Q2 wd:Q3 wd:Q4 wd:Q5 }
?item wdt:P31 ?class;
wdt:P5008 ?project.
# ?article rdfs:comment ?description.
OPTIONAL {
?article schema:about ?item.
?article schema:isPartOf <https://en.wikipedia.org/>.
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
""")
sparql.setReturnFormat(JSON)= sparql.query().convert()
results
print(results)
Get English Wikipedia Article Names from Q ID
"""
sparql.setQuery(SELECT DISTINCT ?lang ?item ?name WHERE {
VALUES ?item { wd:Q1
wd:Q2
wd:Q3
wd:Q4
wd:Q5
}
?article schema:about ?item; schema:inLanguage ?lang; schema:name ?name .
FILTER(?lang in ('en')) .
FILTER (!CONTAINS(?name, ':')) .
}
""")
sparql.setReturnFormat(JSON)= sparql.query().convert()
results
print(results)
Listing of Q IDs from Property
"""
sparql.setQuery( SELECT
?item ?itemLabel
?value ?valueLabel
# valueLabel is only useful for properties with item-datatype
WHERE
{
?item wdt:P2167 ?value
# change P2167 to desired property
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
""")
sparql.setReturnFormat(JSON)= sparql.query().convert()
results
print(results)
subClass and Instance Listings for Q ID
"""
sparql.setQuery(SELECT ?subclass ?subclassLabel ?instance ?instanceLabel
WHERE
{
?subclass wdt:P279 wd:Q183366.
?instance wdt:P31 wd:Q183366.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY xsd:integer(SUBSTR(STR(?subclass),STRLEN("http://www.wikidata.org/entity/Q")+1))
""")
sparql.setReturnFormat(JSON)= sparql.query().convert()
results
print(results)
Missing Q Data from Wikidata
"""
sparql.setQuery(PREFIX schema: <http://schema.org/>
PREFIX w: <https://en.wikipedia.org/wiki/>
SELECT ?wikipedia ?item WHERE {
VALUES ?wikipedia { w:Tom_Hanks }
?wikipedia schema:about ?item .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
""")
sparql.setReturnFormat(JSON)= sparql.query().convert()
results
print(results)
Q ID from Wikipedia ID
"""
sparql.setQuery(PREFIX schema: <http://schema.org/>
PREFIX w: <https://en.wikipedia.org/wiki/>
SELECT ?wikipedia ?item WHERE {
VALUES ?wikipedia { w:Euthanasia
w:Commercial_art_gallery
}
?wikipedia schema:about ?item .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
""")
sparql.setReturnFormat(JSON)= sparql.query().convert()
results
print(results)
schema.org ← → Wikidata Mapping
"""
sparql.setQuery(SELECT ?wd ?wdLabel ?type ?uri ?prefix ?localName WHERE {
{
{ ?wd wdt:P1628 ?uri . BIND("equivalent property" AS ?type) } UNION
{ ?wd wdt:P1709 ?uri . BIND("equivalent class" AS ?type) } UNION
{ ?wd wdt:P2888 ?uri . BIND("exact match" AS ?type) } UNION
{ ?wd wdt:P2235 ?uri . BIND("superproperty" AS ?type) } UNION
{ ?wd wdt:P2236 ?uri . BIND("subproperty" AS ?type) }
}
BIND( REPLACE(STR(?uri),'[^#/]+$',) AS ?prefix)
BIND( REPLACE(STR(?uri),'^.*[#/]',) AS ?localName)
# filter by ontology (otherwise timeout expected)
FILTER(?prefix = "http://schema.org/")
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
} ORDER BY ?prefix ?localName
""")
sparql.setReturnFormat(JSON)= sparql.query().convert()
results
print(results)
Main Topic of Q ID
"""
sparql.setQuery(PREFIX schema: <http://schema.org/>
SELECT ?item ?itemLabel ?mainTopic ?mainTopicLabel WHERE {
VALUES ?item { wd:Q13307732
wd:Q8953981
wd:Q1458376
wd:Q8953071
}
?mainTopic wdt:P910 ?item.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
""")
sparql.setReturnFormat(JSON)= sparql.query().convert()
results
print(results)
DBpedia Queries
DBpedia is a bit more tricky to deal with.
Again, we set up our major call, to be followed by a series of SPARQL queries to DBpedia:
from SPARQLWrapper import SPARQLWrapper, RDFXML
from rdflib import Graph
= SPARQLWrapper("http://dbpedia.org/sparql") sparql
Languages in DBpedia with schema.org Language Code
In this query, we are looking for items that have been already mapped or characterized in a second ontology (schema.org).
"""
sparql.setQuery( PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX schema: <http://schema.org/>
CONSTRUCT {
?lang a schema:Language ;
schema:alternateName ?iso6391Code .
}
WHERE {
?lang a dbo:Language ;
dbo:iso6391Code ?iso6391Code .
FILTER (STRLEN(?iso6391Code)=2) # to filter out non-valid values
}
""")
sparql.setReturnFormat(RDFXML)= sparql.query().convert()
results print(results.serialize(format='xml'))
Missing Definitions
"""
sparql.setQuery(PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX : <http://dbpedia.org/resource/>
SELECT ?item, ?description WHERE {
VALUES ?item { :Child_prostitution
:Ice_Hockey_World_Championships
:Major_League_Soccer
:Tamil_language
:Acne }
?item rdfs:comment ?description .
FILTER ( LANG(?description) = "en" )
}
""")
sparql.setReturnFormat(RDFXML)= sparql.query().convert()
results print(results.serialize(format='xml'))
Get URIs from Aliases
"""
sparql.setQuery(PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?x ?redirectsTo WHERE {
VALUES ?wikipedia { "Abies"@en
"Abolitionists"@en}
?x rdfs:label ?wikipedia .
?x dbo:wikiPageRedirects ?redirectsTo
}
""")
sparql.setReturnFormat(XML)= sparql.query().convert()
results print(results.serialize(format='xml'))
Of course, SPARQL is a language unto itself, and it takes time to become fluent. The examples above are closer to baby-talk than Shakespearean speech. Nonetheless, one begins to gain a feel for the power of the language.
As we move forward, we will try to leverage SPARQL as the query language to our knowledge graph, since it provides the most powerful and flexible language for doing so. There will obviously be times when direct Python calls are more direct and shorter to implement. But the most flexible filters and intersections will come from our use of SPARQL.
Query Resources
A partial, but useful, list of public SPARQL endpoints is provided by:
An assessment of their current availability is provided by:
Here are the top 100 named graphs available with their triple counts:
Wikidata provides its own listing of 100 SPARQL endpoints:
There is an excellent (and growing) compilation of useful SPARQL queries to Wikidata available from:
Two smaller, but similarly useful resource for DBpedia queries, are available from:
- https://aifb-ls3-kos.aifb.kit.edu/projects/spartiqulator/examples.htm
- https://www.cambridgesemantics.com/blog/semantic-university/learn-sparql/sparql-nuts-bolts/
The latter also provides some SPARQL construction tips.
Example OpenStreetMap SPARQL examples are available from:
*.ipynb
file. It may take a bit of time for the interactive option to load.