Completing the Extraction Methods and Formal Packaging
This last installment in our mini-series of packaging our Cooking with Python and KBpedia project will add one more flexible extraction routine, and complete packaging the cowpoke extraction module. This completion will set the template for how we will add additional clusters of functionality as we progress through the CWPK series.
The new extraction routine is geared more to later analysis and use of KBpedia than our current extraction routines. The routines we have developed so far are intended for nearly complete bulk extractions of KBpedia. We would like to add more fine-grained ways to extract various ‘slices’ from KBpedia, especially those that may arise from individual typology changes or SPARQL queries. We could benefit from having an intermediate specification for extractions that could be used for directing results from other functions or analyses.
Steps to Package a Project so Far
To summarize, here are the major steps we have discovered so far to transition from prototype code in the notebook to a Python package:
- Generalize the prototype routines in the notebook
- Find the appropriate
.. Lib/site-packages
directory and define a new directory with your package name, short lowercase best - Create an
__init__.py
file in that same directory (and any subsequent sub-package directories should you define the project deeper); add basic statements similar to above - Create a
__main__.py
file for your shared functions; it is a toss-up whether shared variables should also go here or in__init__.py
- If you want a single point of editing for changing inputs for a given run via a ‘record’ metaphor, follow something akin to what I began with the
config.py
- Create a
my_new_functions.py
module where the bulk of your new work resides (extraction in our current instance). The approach I found helpful was to NOT wrap my prototype functions in a function definition at first. Only in the next step, once the interpreter is able to step through the new functions without errors, do I then wrap the routines in definitions - In an interactive environment (such as Jupyter Notebook), start with a clean kernel and try to
import myproject
(where ‘myproject’ is the current bundle of functionality you are working on). Remember, an import will run the scripts as encountered, and if you have points of failure due to undefined variables or whatever, the traceback on the interpreter will tell you what the problem is and offer some brief diagnostics. Repeat until the new function code processes without error, then wrap in a definition, and move on to the next code block - As defined functions are built, try to look for generalities of function and specification and desired inputs and outputs. These provide the grist for continued re-factoring of your code
- Document your code, which is only just beginning for this KBpedia project. I’ve been documenting much in the notebook pages, but not yet enough in the code itself.
Objectives for This Installment
Here is what I want to accomplish and wrap up in this installment focusing on custom extractions. I want to:
- Arbitrarily select one to many classes or properties for driving an extraction
- Define one or multiple starting points for descendants() or one or multiple individual starting points. This entry point is provided by the variable
root
in our existing extraction routines - Allow the
extract_deck
specification to also define the rendering method (see next) - For iterations, specify input and output file forms, so need: iterator, base + extension logic, relate to existing annotations, etc., as necessary
- This suggests to reuse and build from the existing extraction routines, and
- Improve the file-based and -named orientation of the routines.
The Render Function
You may recall from the tip in CWPK #29 that owlready2 comes with three rendering methods for its results: 1) a default method that has a short namespace prefix appended to all classes and properties; 2) a label method where no prefixes are provided; and 3) a full iri method where all three components of the subject-predicate-object (s-p-o) semantic triple are given their complete IRI. If you recall, here are those three function calls:
set_render_func(default_render_func)
set_render_func(render_using_label)
set_render_func(render_using_iri)
To provide these choices, we will add a render
method specification in the extract_deck
and provide the switch at the top of our extraction routines. We will use the same italicized names to specify which of the three rendering options has been chosen.
A Custom Extractor
The basic realization is that with just a few additions we are able to allow customization of our existing extraction routines. Initially, I thought I would need to write entirely new routines. But, fortunately, apparently our existing routines already are sufficiently general to enable this customization.
Since it is a bit simpler, we will use the struct_extractor
function to show where the customization enhancements need to go. We will also provide the code snippet where the insertion is noted. I provide comments on these additions below the code listing.
Note these same changes are applied to the annot_extractor
function as well (not shown). You can inspect the updated extraction module at the conclusion of this installment.
OK, so let’s explain these customizations:
def struct_extractor(**extract_deck):
print('Beginning structure extraction . . .')
# 1 - render method goes here
r_default = ''
r_label = ''
r_iri = ''
render = extract_deck.get('render')
if render == 'r_default':
set_render_func(default_render_func)elif render == 'r_label':
set_render_func(render_using_label)elif render == 'r_iri':
set_render_func(render_using_iri)else:
print('You have assigned an incorrect render method--execution stopping.')
return
# 2 - note about custom extractions
= extract_deck.get('loop_list')
loop_list = extract_deck.get('loop')
loop = extract_deck.get('out_file')
out_file = extract_deck.get('class_loop')
class_loop = extract_deck.get('property_loop')
property_loop = extract_deck.get('descent_type')
descent_type = extract_deck.get('descent')
descent = extract_deck.get('single')
single = 1
x = []
cur_list = []
a_set = []
s_set = 'owl:Thing'
new_class # 5 - what gets passed to 'output'
with open(out_file, mode='w', encoding='utf8', newline='') as output:
= csv.writer(output)
csv_out if loop == 'class_loop':
= ['id', 'subClassOf', 'parent']
header = 'rdfs:subClassOf'
p_item else:
= ['id', 'subPropertyOf', 'parent']
header = 'rdfs:subPropertyOf'
p_item
csv_out.writerow(header) # 3 - what gets passed to 'loop_list'
for value in loop_list:
print(' . . . processing', value)
= eval(value)
root # 4 - descendant or single here
if descent_type == 'descent':
= root.descendants()
a_set = set(a_set)
a_set = a_set.union(s_set)
s_set elif descent_type == 'single':
= root
a_set
s_set.append(a_set)else:
print('You have assigned an incorrect descent method--execution stopping.')
return
print(' . . . processing consolidated set.')
for s_item in s_set:
= s_item.is_a
o_set for o_item in o_set:
= (s_item,p_item,o_item)
row_out
csv_out.writerow(row_out)if loop == 'class_loop':
if s_item not in cur_list:
= (s_item,p_item,new_class)
row_out
csv_out.writerow(row_out)
cur_list.append(s_item)= x + 1
x print('Total unique IDs written to file:', x)
The notes that follow pertain to the code listing above.
The render method (#1) is just a simple switch set in the configuration file. Only three keyword options are allowed; if a wrong keyword is entered, the error is flagged and the routine ends. We also added a ‘render’ assignment at the top of the code block.
What now makes this routine (#2) a custom one is the use of the configurable custom_dict
and its configuration settings. The custom_dict
dictionary is specified by assigning to the loop_list
(#3). The custom_dict
dictionary can take one or many key:value pairs. The first item, the key, should take the name that you wish to use as the internal variable name. The second item, the value, should correspond to the property or class with its namespace prefix. Here are the general rules and options available for a custom extraction:
- You may enter properties OR classes into the
custom_dict
dictionary, but not both, in your pre-run configurations - The ‘
iri
‘ switch for the renderer is best suited for thestruct_extractor
function. It should probably not be used for annotations given the large number of output columns and loss of subsequent readability when using the full IRI. The choice of actual prefix is likely not that important since it is easy to do global search-and-replaces when in bulk mode - You may retrieve items in the
custom_dict
dictionary either singly or all of its descendants, depending on the use of the ‘single’ and ‘descent’ keyword options (see #4 next).
Item #4 is another switch to either run the entries in the custom_dict
dictionary as single ‘roots’ (thus no sub-classes or sub-properties) or with all descendants. The descent_type
has been added to the extract_deck
settings, plus we added the related assignments to the beginning of this code block.
The last generalized capability we wanted to capture was the ability to print out all of the structural aspects of KBpedia’s typologies, which suggested some code changes at roughly #5 above. While I am sure I could have figured out a way to do this, because of interactions with the other customizations this addition proved to be more complicated than warranted. So, rather than spend undue time trying to cram everything into a single, generic function (struct_extractor
), I decided the easier and quicker choice was to create its own function, picking up on many of the processing constructs developed for the other extractor routines.
Basically, what we want in a typology extract is:
- Separate extractions of individual typologies to their own named files
- Removal of the need to find unique resources across multiple typologies. Rather, the intent is to capture the full scope of structural (
subClassOf
aspects in each typology - A design that enables us to load a typology as an individual ontology or knowledge graph into a tool such as Protégé.
By focusing on a special extractor limited to classes, typologies, structure, and single output files per typology, we were able to make the function rather quickly and simply. Here is the result, the typol_extractor
:
def typol_extractor(**extract_deck):
print('Beginning structure extraction . . .')
r_default = ''
r_label = ''
r_iri = ''
render = extract_deck.get('render')
if render == 'r_default':
set_render_func(default_render_func)elif render == 'r_label':
set_render_func(render_using_label)elif render == 'r_iri':
set_render_func(render_using_iri)else:
print('You have assigned an incorrect render method--execution stopping.')
return
= extract_deck.get('loop_list')
loop_list = extract_deck.get('loop')
loop = extract_deck.get('class_loop')
class_loop = extract_deck.get('base')
base = extract_deck.get('ext')
ext = 'owl:Thing'
new_class if loop is not 'class_loop':
print("Needs to be a 'class_loop'; returning program.")
return
= ['id', 'subClassOf', 'parent']
header = 'rdfs:subClassOf'
p_item for value in loop_list:
print(' . . . processing', value)
= 1
x = []
s_set = []
cur_list = eval(value)
root = root.descendants()
s_set = value.replace('kko.','')
frag = (base + frag + ext)
out_file with open(out_file, mode='w', encoding='utf8', newline='') as output:
= csv.writer(output)
csv_out
csv_out.writerow(header) for s_item in s_set:
= s_item.is_a
o_set for o_item in o_set:
= (s_item,p_item,o_item)
row_out
csv_out.writerow(row_out)if s_item not in cur_list:
= (s_item,p_item,new_class)
row_out
csv_out.writerow(row_out)
cur_list.append(s_item)= x + 1
x
output.close() print('Total unique IDs written to file:', x)
Two absolute essentials for this routine are to set the 'loop'
key to 'class_loop'
and to set the 'loop_list'
key to typol_dict.values()
.
Note the code in the middle of the routine that creates the file name after replacing (removing) the ‘kko.’ prefix from the value name in the dictionary. We also needed to add two further entries to the extract_deck
dictionary.
With the caveat that your local file structure is likely different than what we set up for this project, should it be similar the following commands can be used to run these routines. Should you test different possibilities, make sure your input specifications in the extract_deck
are modified appropriately. Remember, to always work from copies so that you may restore critical files in the case of an inadvertent overwrite.
Here are the commands:
from cowpoke.__main__ import *
from cowpoke.config import *
import cowpoke
import owlready2
cowpoke.typol_extractor(**cowpoke.extract_deck)
The extract.py File
Again, assuming you have set up your files and directories similar to what we have suggested, you can inspect the resulting extractor code in this new module (modify the path as necessary):
with open(r'C:\1-PythonProjects\Python\Lib\site-packages\cowpoke\extract.py', 'r') as f:
print(f.read())
Summary of the Module
OK, so we are now done with the development and packaging of the extractor module for cowpoke. Our efforts resulted in the addition of four files under the ‘cowpoke’ directory. These files are:
- The
__init__.py
file that indicates the cowpoke package - The
__main__.py
file where shared start-up functions reside - The
config.py
file where we store our dictionaries and where we specify new run settings in the specialextract_deck
dictionary, and - The
extract.py
module where all of our extraction routines are housed.
This module is supported by three dictionaries (and the fourth special one for the run configurations):
- The
typol_dict
dictionary of typologies - The
prop_dict
dictionary of top-level property roots - The
custom_dict
dictionary for tailored starting point extractions, and - The
extract_deck
special dictionary for extraction run settings.
In turn, most of these dictionaries can also be matched with three different extractor routines or functions:
- The
annot_extractor
function for extracting annotations - The
struct_extractor
function for extracting the is-a relations in KBpedia, and - The
typol_extractor
dedicated function for extracting out the individual typologies into individual files.
In our next CWPK installment we will discuss how we might manipulate this extracted information in a bulk manner using spreadsheets and other tools. These same extracted files, perhaps after bulk manipulations or other edits and changes, will then form the basis for the input files that we will use to build new versions of KBpedia (or your own extensions and changes to it) from scratch. We are now half-way around our roundtrip.
*.ipynb
file. It may take a bit of time for the interactive option to load.
I think there was a mistake in the cowpoke typology extractor code. In particular, should run_deck.get(‘render’) be actually extractor_deck.get(‘render’)?
Hi Varun,
Yes, good catch, you are correct. My first version lumped both build and extract configurations under the ‘run_deck’ function, but I found it made sense to split them. I have updated the post and the *.ipynb file.
Thanks, Mike