Finally Getting a Remote SPARQL Working Instance
Yesterday’s installment of Cooking with Python and KBpedia presented the first part of this two-part series on developing a SPARQL endpoint for KBpedia on a remote server. This concluding part picks up with step #7 in the stepwise approach I took to complete this task.
At the outset I thought it would progress rapidly: After all, is not SPARQL a proven query language with central importance to knowledge graphs? But, possibly because our focus in the series is Python, or perhaps for other reasons, I have found a dearth of examples to follow regarding setting up a Python SPARQL endpoint (there are some resources available related to REST APIs).
The first six steps in yesterday’s installment covered getting our environment set up on the remote Linux server, including installing the Web framework Flask and creating a virtual environment. We also presented the Web page form design and template for our SPARQL query form. This second part covers the steps of tieing this template form into actual endpoint code, which proved to be simple in presentation but exceeding difficult to formulate and debug. Once this working endpoint is in hand, I next cover the steps of giving the site an external URL address, starting and stopping the service on the server, and packaging the code for GitHub distribution. I conclude this two-part series with some lessons learned, some comments on the use and relevance of linked data, and point to additional documentation.
Step-wise Approach (con’t)
We pick up our step-wise approach here.
7. Tie SPARQL Form to Local Instance
So far, we have a local instance that works from the command line and an empty SPARQL form. We need to relate thest two pieces together. In the last installment, I noted two SPARQL-related efforts, pyLDAPI (and its GNAF example) and adhs. I could not find working examples for either, but I did consult their code frequently while testing various options.
Thus, unlike many areas throughout this CWPK series, I really had no working examples from which to build or modify our current SPARQL endpoint needs. While the related efforts above and other examples could provide single functions or small snippets, possibly as use for guidance or some modification, it was pretty clear I was going to need to build up the code step-by-step, in a similar stepwise manner to what I was following for the entire endpoint. Fortunately, as described in step #6, I did have a starting point for the Web page template using the GNAF example.
From a code standpoint, the first area we need to address is to convert our example start-up stub, what was called test_sparql.py
in the CWPK #58 installment, to our main application for this endpoint. We choose to call it cowpoke-endpoint.py
in keeping with its role. We will build on the earlier stub by adding most of our import
and Flask-routing (‘@app.route("/")
‘, for example) statements, as well as the initialization code for the endpoint functions. We will call out some specific aspects of this file as we build it.
The second coding area we need to address is how to tie the text areas in our Web form template to the actual Python code. We will put some of that code in the template and some of that code in cowpoke-endpoint.py
governing getting and processing the SPARQL query. There is a useful pattern for how to relate templates to Python code via what might be entered into a text area from StackOverflow. Here is the code example that should be put within the governing template, using the important {{ url_for('submit') }}
:
<form action="{{ url_for('submit') }}" method="post">
<textarea name="text">
<input type="submit">
</form>
and here is the matching code that needs to go into the governing Python file:
from flask import Flask, request, render_template
app = Flask(__name__)
@app.route('/')
def index():
return render_template('form.html')
@app.route('/submit', methods=['POST'])
def submit():
return 'You entered: {}'.format(request.form['text'])
Note that file names, form names and routes all need to be properly identified and matched. Also note that imports need to be complete. Further notice in the file listing below that we modify the return
statement. We also repeat this form related to the SPARQL results text area.
One of the challenging needs in the code development was working with a remote instance, as opposed to local code. I was also now dealing with a Linux environment, not my local Windows one. After much trial-and-error, which I’m sure is quite familiar to professional developers working in a client-server framework, I learned some valuable (essential!) lessons:
-
First, with my
miniconda
approach and its minimal starting Python basis, I needed to check every new package import required by the code and check whether it was already in the remote instance. Theconda list
command is important here to first check whether the package is already in the Python environment or not. If not, I would need to find the proper repository for the package and install it per the instructions in CWPK #58 -
I needed to make sure that the permission (Linux
chmod
) and ownership (Linuxchown
settings were properly set on the target directories for the remote instance such that I could use my SSH-based file transfer program (WinSCP in my case; Filezilla is another leading option). I simply do not do enough Linux work to be comfortable with remote editors. SSH transfer would enable me to work on the developing code in my local Windows instance -
I needed to get basic templates working early, since I needed Web page targets for where the outputs or traces of the running code would display
-
I needed to restart the Apache2 server whenever there was a new code update to be tested. This resulted in a fairly set workflow of edit → upload → re-start Apache → call up remote Web template form (e.g., http://xx.xxx.xxx.xxx/sparql) → inspect trace or logs → rinse and repeat
-
Be attentive to and properly set content types, since we are moving data and results from Web forms to code and back again. Content header information can be tricky, and one needs to use cURL or wget (or Postman, which is often referenced, but I did not use). One way to inspect headers and content types is in the output Web page templates, using this code:
req = request.form print(req)
-
In HTML forms, use the
<
code for the left angle bracket symbol (used in SPARQL queries to denote a URI link), otherwise the link will not display on the Web page since this character is reserved -
Used the standard W3C validator when needing to check encodings and Web addresses
-
Be extremely attentive to the use of tabs v white spaces in your Python code. Get in the habit of using spaces only, and not tabbing for indents. Editors are more forgiving in a Windows development environment; Linux ones are not.
The reason I began assembling these lessons arose from the frustrations I had in early code development. Since I was getting small pieces of the functionality running directly in Python from the command line, some of which is shown in the prior two installments, my initial failures to import these routines in a code file (*.py
) and get them to work had me pulling my hair out. I simply could not understand why routines that worked directly from the command line did not work once embedded into a code file.
One discovery is that Flask does not play well with the Python list
command. If one inspects prior SPARQL examples in this series (for example, CWPK #25), one can see that this construct is common with the standard query code. One adjustment, therefore, was to remove the list
generator, and install a looping function for the query output. This applied to both RDFLib and owlready2.
Besides the lessons presented above, some of the hypotheses I tested to get things to work included the use of CDATA
(which only applies to XML), pasting to or saving and retrieving from intermediate text files, changing content-type
or mimetype
, treatment of the Python multi-line convention ("""
), possible use of JavaScript, and more. Probably the major issue I needed to overcome was turning on space and tab display in my local editor to remove their mixed use. This experience really brought home to me the fundamental adherence to indentation in the Python language.
Nonetheless, by following these guidelines and with eventual multiple tries, I was finally able to get a basic code block working, as documented under the next step.
8. Create and validate an external SPARQL query using SPARQLwrapper to this endpoint.
Since the approach that worked above got closer to the standard RDFLib approach, I decided to expand the query form to allow for external searches as well. Besides modifications to the Web page template, the use of external sources also invokes the SPARQLwrapper extension to RDFLib. Though its results presentation is a bit different, and we now have a requirement to also input and retrieve the URL of the external SPARQL endpoint, we were able to add this capability fairly easily.
The resulting code is actually quite simple, though the path to get there was anything but. I present below the eventual code file so developed, with code notes following the listing. You will see that, aside from the Flask code conventions and decorators, that our code file is quite similar to others developed throughout cowpoke:
from flask import Flask, Response, request, render_template # Note 1
from owlready2 import *
import rdflib
from rdflib import Graph
import json
from SPARQLWrapper import SPARQLWrapper, JSON, XML
# load knowledge graph files
= '/var/data/kbpedia/kbpedia_reference_concepts.owl' # Note 2
main = 'http://www.w3.org/2004/02/skos/core'
skos_file = '/var/data/kbpedia/kko.owl'
kko_file
# set up scopes and namespaces
= World() # Note 2
world = world.get_ontology(main).load()
kb = kb.get_namespace('http://kbpedia.org/kko/rc/')
rc = world.get_ontology(skos_file).load()
skos
kb.imported_ontologies.append(skos)= world.get_ontology(kko_file).load()
kko
kb.imported_ontologies.append(kko)
= world.as_rdflib_graph()
graph
# set up Flask microservice
= Flask(__name__) # Note 3
app
@app.route("/")
def sparql_form():
return render_template('sparql_page.html')
# set up route for submitting query, receiving results
@app.route('/submit', methods=['POST']) # Note 4
def submit():
# if request.method == 'POST':
= None
q_submit = ''
results if request.form['q_submit'] is None or len(request.form['q_submit']) < 5:
return Response(
'Your request to the SPARQL endpoint must contain a \'query\'.',
= 'text/plain'
mimetype
)else:
= request.form['q_submit'] # Note 5
data = request.form['selectSource']
source format = request.form['selectFormat']
= request.values.get('q_url')
q_url try: # Note 6
if source == 'kbpedia' and format == 'owlready': # Note 7
= graph.query_owlready(data) # Note 8
q_query for row in q_query:
= str(row)
row = results + row
results = results.replace(']', ']\n')
results elif source == 'kbpedia' and format == 'rdflib':
= graph.query(data) # Note 8
q_query for row in q_query:
= str(row)
row = results + row
results = results.replace('))', '))\n')
results elif source == 'kbpedia' and format == 'xml':
= graph.query(data)
q_query for row in q_query:
= str(row)
row = results + row
results = q_query.serialize(format='xml')
results = str(results)
results = results.replace('<result>', '\n<result>')
results elif source == 'kbpedia' and format == 'json':
= graph.query(data)
q_query for row in q_query:
= str(row)
row = results + row
results = q_query.serialize(format='json')
results = str(results)
results = results.replace('}}, ', '}}, \n')
results elif source == 'kbpedia' and format == 'html': #Note 9
= graph.query(data)
q_query for row in q_query:
= str(row)
row = results + row
results = q_query.serialize(format='csv')
results = str(results)
results = results.readlines()
results # table = '<html><table>'
for row in results:
# row = str(row)
= row[0]
result # row = row.replace('\r\n', '')
# row = row.replace(',', '</td><td>')
# table += '<tr><td>' + row + '</td></tr>' + '\n'
# table += '</table><br></html>'
# results = table
return result
elif source == 'kbpedia' and format == 'txt':
= graph.query(data)
q_query for row in q_query:
= str(row)
row = results + row
results = q_query.serialize(format='txt')
results elif source == 'kbpedia' and format == 'csv':
= graph.query(data)
q_query for row in q_query:
= str(row)
row = results + row
results = q_query.serialize(format='csv')
results elif source == 'external' and format == 'rdfxml':
= str(q_url)
q_url = q_url
results elif source == 'external' and format == 'xml':
= SPARQLWrapper(q_url)
sparql = data.replace('\r', '')
data
sparql.setQuery(data)= sparql.query()
results elif source == 'external' and format == 'json': #Note 10
= SPARQLWrapper(q_url)
sparql = data.replace('\r', '')
data # data = data.replace("\n", "\n' + '")
# data = '"' + data + '"'
sparql.setQuery(data)#Note 10
sparql.setReturnFormat(JSON) = sparql.queryAndConvert()
results # q_sparql = str(sparql)
# results = q_sparql
else: #Note 11
= ('This combination of Source + Format is not available. Here are the possible combinations:\n\n' +
results ' Kbpedia: owlready2: Formats: owlready2\n' +
' rdflib: rdflib\n' +
' xml\n' +
' json\n' +
' *html\n' +
' text\n' +
' csv\n' +
' External: as entered: rdf/xml\n' +
' *json\n\n' +
' * combo still buggy')
if format == 'html':
return Response(results, mimetype='text/html') # Note 9, 12
else:
return Response(results, mimetype='text/plain')
except Exception as e: # Note 6
return Response(
'Error(s) found in query: ' + str(e),
= 'text/plain'
mimetype
)
if __name__ == "__main__":
=true) app.run(debug
Here are some annotation notes related to this code, as keyed by note number above:
-
There are many specific packages needed for this SPARQL application, as discussed in the main text. The major point to make here is that each of these packages needs to be loaded into the remote virtual environment, per the discussion in CWPK #58
-
Like other cowpoke modules, these are pretty standard calls to the needed knowledge graphs and configuration settings
-
These are the standard Flask calls, as discussed in the prior installment
-
The main routine for the application is located here. We could have chosen to break this routine into multiple files and templates, but since this application is rather straightforward, we have placed all functionality into this one function block
-
These are the calls that bring the assignments from the actual Web page (template) into the application
-
We set up a standard
try . . . exception
block, which allows an error, if it occurs, to exit gracefully with a possible error explanation -
We set up all execution options as a two-part condition. One part is whether the source is the internal KBpedia knowledge graph (which may use either the standard
rdflib
orowlready2
methods) or is external (which uses thesparqlwrapper
method). The second part is which of eight format options might be used for the output, though not all are available to the source options; see further Note 11. Also, most of the routines have some minor code to display results line-by-line -
Here is where the graph query function differs by whether RDFLib or owlready2 is used
-
As of the time of release of this installment, I am still getting errors in this HTML output routine. I welcome any suggestions for working code here
-
As of the time of release of this installment, I am still getting errors in this JSON output routine. I have tried the standard SPARQLwrapper code, SPARQLwrapper2, and adding the JSON format to the initial
sparql
, all to no avail. It appears there may be some character or encoding issue in moving the query on the Web form to the function. The error also appears to occur in the line indicated. I welcome any suggestions for working code here -
This is where any of the two-part combos discussed in Note #7 that do not work get captured
-
This
if . . . else
enables the HTML output option.
9. Set up an external URI to the localhost instance With this working code instance now in place, it was time to expose the service through a standard external URI. (During development we used http://xx.xxx.xxx.xxx/sparql). The URL we chose for the service is http://sparql.kbpedia.org/.
We first needed to set up a subdomain pointing to the service via our DNS provider. While we generally provide SSL support for all of our Web sites (the secure protocol behind the https:
Web prefix), we decided the minor use of this SPARQL site did not warrant keeping the certificates enabled and current. So, this site is configured for http:
alone.
We first configured our Flask sites as described in CWPK #58. To get this site working under the new URL, I only needed to make two changes to the earlier configuration. This configuration file is 000-default.conf
and is found on my server at the /etc/apache2/sites-enabled
directory. Here at the two changes, called out by note:
<VirtualHost *:80>
ServerName sparql.kbpedia.org #Note 1
ServerAdmin mike@mkbergman.com
DocumentRoot /var/www/html
WSGIDaemonProcess sparql python-path=/usr/bin/python-projects/miniconda3/envs/sparql/lib/python3.8/site-packages
WSGIScriptAlias / /var/www/html/sparql/wsgi.py #Note 2
<Directory /var/www/html/sparql>
WSGIProcessGroup sparql
WSGIApplicationGroup %{GLOBAL}
Order deny,allow
Allow from all
</Directory>
ErrorLog ${APACHE_LOG_DIR}/error.log
CustomLog ${APACHE_LOG_DIR}/access.log combined
</VirtualHost>
The first change was to add the new domain sparql.kbpedia.org
under ServerName
. The second change was to replace the /sparql
alias to /
under the WSGIScriptAlias
directive.
10. Set up an automatic start/re-start cron job.
The last step under our endpoint process is to schedule acron
job on the remote server to start up the sparql
virtual environment in the case of an unintended shut down or breaking of the Web site. This last task means we can let the endpoint run virtually unattended. First, let’s look at how simple a re-activation (xxx
) script may look:
#!/bin/sh
conda activate sparql
Note the standard bash script header on this file. Also note our standard activation statement. One can create this file and then place it in a logical, findable location. In our instance, we will put it where the same sparql
scripts exist, namely in /var/www/html/sparql/
.
We next need to make sure this script is readable by our cron
job. So we navigate to the the directory where this bash script is located and change its permissions:
chmod +x re_activate.sh
Once these items are set, we are now able to add this bash script to our scheduled cron
jobs. We find this specification and invoke our editor of that file by using:
nano /etc/crontab
Using the nano
editor conventions (or those of your favored editor), we can now add our new cron
job in a new entry between the asterisk (*) specifications:
30 * * * * /bin/sh /var/www/html/sparql/re_activate.sh
We have now completed all of our desired development steps for the KBpedia SPARQL endpoint. As of the release of today’s installment, the site is active.
Endpoint Packaging
I will package up this code as a separate project and repository on GitHub per the steps outlined in CWPK #46 under the MIT license, same as cowpoke. Since there are only a few files, we did not create a formal pip
package. Here will be the package address:
https://github.com/Cognonto/cowpoke-endpoint
Linked Data and Why Not Employed
My original plan was to have this SPARQL site offer linked data. Linked data is where the requesting user agent may be served either semantic data such as RDF in various formats or standard HTML if the requester is a browser. It is a useful approach for the semantic Web and has a series of requirements to qualify as ‘5-star‘ linked open data.
From a technical standpoint, the nature of the requesting user agent is determined by the local Web server (Apache2 in our case), which then routes the request to produce either structured data or semi-structured HTML for displaying in a Web page through a process known as content negotiation (the term is sometimes shortened to ‘conneg’). In this manner, our item of interest can be denoted with a single URI, but the content version that gets served to the requesting agent may differ based on the nature of the agent or its request. In a data-oriented setting, for example, the requested data may be served up in a choice of formats to make it easier to consume on the receiving end.
As I noted in my initial investigations regarding Python (CWPK #58), there are not many options compared to other languages such as Java or JavaScript. One of the reasons I initially targeted pyLDAPI was that it promised to provide linked data. (RDFLib-web used to provide an option, but it is no longer maintained and does not work on Python 3.) Unfortunately, I could find no working instances of the pyLDAPI code and, when inspecting the code base itself, I was concerned about the duplicated number of Flask templates required by this approach. Given the number and diversity of classes and properties in KBpedia, my initial review suggested pyLDAPI was not a tenable approach, even if I could figure out how to get the code working.
Given the current state of development, my suggestion is to go with an established triple store with linked data support if one wants to provide linked data. It does not appear that Python has a sufficiently mature option available to make linked data available at acceptable effort.
Lessons and Possible Enhancements
The last section summarized the relative immature state of Python for SPARQL and linked data purposes. In order to get the limited SPARQL functionality working in this CWPK series I have kept my efforts limited to the SPARQL ‘SELECT’ statement and have noted many gotchas and workarounds in the dicussions over this and the prior two installments. Here are some additional lessons not already documented:
- Flask apparently does not like ‘return None’
- Our minimal conda installation can cause problems with ‘standard’ Python packages dropped from the
miniconda3
distro. One of these isjson
, which I ultimately needed to obtain fromconda install -c jmcmurray json
.
Clearly, some corners were cut above and some aspects ignored. If one wanted to fully commercialize a Python endpoint for SPARQL based on the efforts in this and the previous CWPK installments, here are some useful additions:
- Add the full suite of SPARQL commands to the endpoint (e.g., CONSTRUCT, ASK, DESCRIBE, plus other nuances)
- Expand the number of output formats
- Add further error trapping and feedback for poorly-formed queries, and
- Make it easier to add linked data content negotiation.
Of course, these enhancements do not include more visual or user-interface assists for creating SPARQL queries in the first place. These are useful efforts in their own right.
End of Part V
This installment marks the end of our Part V: Mapping, Stats, and Other Tools. We begin Part VI next week governing natural language applications and machine learning involving KBpedia. We are also now 80% of the way through our entire CWPK series.
Additional Documentation
Here are related documents, some which extend the uses discussed herein:
Flask Resources
- https://hackersandslackers.com/flask-routes/
- https://www.digitalocean.com/community/tutorials/how-to-structure-large-flask-applications
- https://www.digitalocean.com/community/tutorials/processing-incoming-request-data-in-flask
RDFLib
- https://rdflib.readthedocs.io/en/stable/intro_to_sparql.html
- https://rdflib.readthedocs.io/en/stable/apidocs/rdflib.plugins.sparql.html
- https://rdflib.readthedocs.io/en/stable/modules/rdflib/query.html
- https://rdflib.readthedocs.io/en/stable/apidocs/rdflib.plugins.stores.html#module-rdflib.plugins.stores.sparqlconnector
- https://rdflib.readthedocs.io/en/stable/apidocs/rdflib.extras.html#module-rdflib.extras.infixowl
- https://rebeccabilbro.github.io/rdflib-and-sparql/
SPARQLwrapper
- https://rdflib.dev/sparqlwrapper/
- https://sparqlwrapper.readthedocs.io/en/latest/main.html
- https://readthedocs.org/projects/sparqlwrapper/downloads/pdf/stable/
- https://pypi.org/project/SPARQLWrapper/
Other
- Using cURL for SPARQL
- RDFLib JSON-LD
- Flask-RDF, a Flask decorator to output RDF using content negotiation
- Cautions about SPARQL endpoints.
*.ipynb
file. It may take a bit of time for the interactive option to load.