Generalization, Packaging, and Complexity Compel More Powerful Tools
Over the past installments of this Cooking with Python and KBpedia series, we have been building up larger and more complex routines from our coding blocks. This approach has been great for learning and prototyping, but does not readily support building maintainable applications. This is a natural evolution in any code development that is moving towards real use and deployment. It is a step that Project Jupyter is also taking in its efforts to transition from Notebook to JupyterLab (see here). Their intent is to provide a complete code development environment as well as one suitable for interactive notebooks.
Recent announcements aside, we picked the Spyder IDE and installed it in CWPK #11 for these same functional reasons, and will stay with it throughout this series because of its maturity and degree of acceptance within the data science community. But, JupyterLab looks to be a promising development.
Whatever the tool, there comes a time when code proliferation and the need to manage it to a release condition warrants moving beyond prototyping. Now is that time with our project.
We will use the packaging of our extraction routines begun in the last installment as our example case for how to proceed. We will continue to use Jupyter Notebook to discuss and present code snippets, but that material is now to be backed up with methods, code files, and modules, hopefully in an acceptable Python way. We will be using Spyder for these development purposes and referring to it in our documentation with screen captures and discussion as appropriate. We will also be releasing Python files as our installments proceed. But the transition to working code is more complicated than changing tool emphasis alone.
An obsession for many programmers, and not a bad one by the way, is to embrace a DRY (don’t repeat yourself) mindset that seeks to reduce duplicative patterns and to find generalities within code. Apparently, if properly done, DRY leads to easier to maintain and understandable code. It also increases inter-dependencies and places a premium on the architecture and modularization (the packaging) of the code base. Definitions of functions and methods and their organization are part of this. By no means do I have the experience and background to offer any advice in these areas, other than to try myself to identify and generalize repeatable patterns. With these caveats in mind, let’s proceed to package some code.
The Objective of These Three Parts
In this installment and the two subsequent ones, we will complete an extraction ‘module’ for KBpedia, and organize and package its functions and defintions. We will set up four program files: 1) an __init__.py
standard file that begins a package; 2) a __main__.py
code that sets the standard module setup and starting assignments; 3) a config.py
file where we set initial parameters for new runs and define our shared dictionaries; and 4) an extract.py
set of methods governing our specific KBpedia extraction routines. The first two files are a sort of boilerplate. The third file is intended for where all initialization specifications are entered prior to any new runs. I am hoping to set this project up in such a way that only changes need to be made to the config.py
file prior to any given run. The fourth file, extract.py
, is the meat of the extraction logic and routines and represents the first of multiple clusters of related functionality. As we formulate these clusters, we will also have a need to look at our overall code and directory organization a few installments from now. For the time being, we will focus on these four starting program files.
As we discussed in CWPK #18, a module is an individual Python file (*.py
) that may set assignments, load resources, define classes, conduct I/O, or define or execute functions. A package in Python is a directory structure that combines one or more Python modules into a coherent library or set of related functions. We are ultimately aiming to produce an entire package of Python functions for extracting, building, testing, or using KBpedia.
In the first part of this three-part mini-series we will complete a generic method for extracting annotations to file for any of our objects in the KBpedia system. We will be pushing the DRY concept a little harder in this installment. In the second part, we will transition that generalized annotation extraction code from the notebook to a Python package, and extend our general approach to structure extraction. And, in the third part, we will modify the structure extraction to support individual typology files and complete the steps to a complete KBpedia extraction package. It is this baseline package to which we will add further modules as the remaining CWPK series proceeds.
Starting Routine
We again start with our standard opening routine. This set of statements, by the way, will be moved to the __main__.py
module, with the file declarations going to the config.py
module.
#
) out.= 'C:/1-PythonProjects/kbpedia/sandbox/kbpedia_reference_concepts.owl'
kbpedia # kbpedia = 'https://raw.githubusercontent.com/Cognonto/CWPK/master/sandbox/builds/ontologies/kbpedia_reference_concepts.owl'
= 'http://www.w3.org/2004/02/skos/core'
skos_file = 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl'
kko_file # kko_file = 'https://raw.githubusercontent.com/Cognonto/CWPK/master/sandbox/builds/ontologies/kko.owl'
from owlready2 import *
= World()
world = world.get_ontology(kbpedia).load()
kb = kb.get_namespace('http://kbpedia.org/kko/rc/')
rc
= world.get_ontology(skos_file).load()
skos
kb.imported_ontologies.append(skos)= world.get_namespace('http://www.w3.org/2004/02/skos/core#')
core
= world.get_ontology(kko_file).load()
kko
kb.imported_ontologies.append(kko)= kb.get_namespace('http://kbpedia.org/ontologies/kko#') kko
More Initial Configuration
As noted in the objective, we also will codify the starting dictionaries as we defined in CWPK #32. As we begin packaging, these next two dictionary components will be moved to the config.py
module.
= {
typol_dict 'ActionTypes' : 'kko.ActionTypes',
'AdjunctualAttributes' : 'kko.AdjunctualAttributes',
'Agents' : 'kko.Agents',
'Animals' : 'kko.Animals',
'AreaRegion' : 'kko.AreaRegion',
'Artifacts' : 'kko.Artifacts',
'Associatives' : 'kko.Associatives',
'AtomsElements' : 'kko.AtomsElements',
'AttributeTypes' : 'kko.AttributeTypes',
'AudioInfo' : 'kko.AudioInfo',
'AVInfo' : 'kko.AVInfo',
'BiologicalProcesses' : 'kko.BiologicalProcesses',
'Chemistry' : 'kko.Chemistry',
'Concepts' : 'kko.Concepts',
'ConceptualSystems' : 'kko.ConceptualSystems',
'Constituents' : 'kko.Constituents',
'ContextualAttributes' : 'kko.ContextualAttributes',
'CopulativeRelations' : 'kko.CopulativeRelations',
'Denotatives' : 'kko.Denotatives',
'DirectRelations' : 'kko.DirectRelations',
'Diseases' : 'kko.Diseases',
'Drugs' : 'kko.Drugs',
'EconomicSystems' : 'kko.EconomicSystems',
'EmergentKnowledge' : 'kko.EmergentKnowledge',
'Eukaryotes' : 'kko.Eukaryotes',
'EventTypes' : 'kko.EventTypes',
'Facilities' : 'kko.Facilities',
'FoodDrink' : 'kko.FoodDrink',
'Forms' : 'kko.Forms',
'Generals' : 'kko.Generals',
'Geopolitical' : 'kko.Geopolitical',
'Indexes' : 'kko.Indexes',
'Information' : 'kko.Information',
'InquiryMethods' : 'kko.InquiryMethods',
'IntrinsicAttributes' : 'kko.IntrinsicAttributes',
'KnowledgeDomains' : 'kko.KnowledgeDomains',
'LearningProcesses' : 'kko.LearningProcesses',
'LivingThings' : 'kko.LivingThings',
'LocationPlace' : 'kko.LocationPlace',
'Manifestations' : 'kko.Manifestations',
'MediativeRelations' : 'kko.MediativeRelations',
'Methodeutic' : 'kko.Methodeutic',
'NaturalMatter' : 'kko.NaturalMatter',
'NaturalPhenomena' : 'kko.NaturalPhenomena',
'NaturalSubstances' : 'kko.NaturalSubstances',
'OrganicChemistry' : 'kko.OrganicChemistry',
'OrganicMatter' : 'kko.OrganicMatter',
'Organizations' : 'kko.Organizations',
'Persons' : 'kko.Persons',
'Places' : 'kko.Places',
'Plants' : 'kko.Plants',
'Predications' : 'kko.Predications',
'PrimarySectorProduct' : 'kko.PrimarySectorProduct',
'Products' : 'kko.Products',
'Prokaryotes' : 'kko.Prokaryotes',
'ProtistsFungus' : 'kko.ProtistsFungus',
'RelationTypes' : 'kko.RelationTypes',
'RepresentationTypes' : 'kko.RepresentationTypes',
'SecondarySectorProduct': 'kko.SecondarySectorProduct',
'Shapes' : 'kko.Shapes',
'SituationTypes' : 'kko.SituationTypes',
'SocialSystems' : 'kko.SocialSystems',
'Society' : 'kko.Society',
'SpaceTypes' : 'kko.SpaceTypes',
'StructuredInfo' : 'kko.StructuredInfo',
'Symbolic' : 'kko.Symbolic',
'Systems' : 'kko.Systems',
'TertiarySectorService' : 'kko.TertiarySectorService',
'Times' : 'kko.Times',
'TimeTypes' : 'kko.TimeTypes',
'TopicsCategories' : 'kko.TopicsCategories',
'VisualInfo' : 'kko.VisualInfo',
'WrittenInfo' : 'kko.WrittenInfo'
}
= {
prop_dict 'objectProperties' : 'kko.predicateProperties',
'dataProperties' : 'kko.predicateDataProperties',
'representations' : 'kko.representations',
}
The Generic Annotation Routine
So, now we come to the heart of the generic annotation extraction routine. For grins as much as anything else, I have wanted to take the DRY perspective and create a generic annotation extractor that could apply to any object or any aggregations of objects within KBpedia. I first tested it with the structure dictionary (typol_dict
) and then generalized arguments and adding some additional extractors to handle properties (using prop_dict
) as well. The routine as shown below accomplishes our desired extraction objectives.
You can Run this routine, but also change some of the switches to test class versus property extractions as well. To go through the entire set of typologies (typol_dict
) takes about 8 minutes to process on a conventional desktop. All other combos including those for properties run much quicker.
I provide line-by-line comments as appropriate to capture the changes needed to generalize this routine. I also add some comments about how we will then break this code block apart in order to conform with the setup and configuration approach. Here is the routine, with the comments detailed below it:
import csv # #1
def render_using_label(entity): # #14
return entity.label.first() or entity.name
set_render_func(render_using_label)
= 1 # #2
x = []
cur_list = 0
class_loop = 1 # #3
property_loop = property_loop # #15
loop = prop_dict.values() # #4
loop_list print('Beginning annotation extraction . . .')
= 'C:/1-PythonProjects/kbpedia/sandbox/prop_annot_out.csv' # #15
out_file = ''
p_set with open(out_file, mode='w', encoding='utf8', newline='') as output:
= csv.writer(output) # #5
csv_out if loop == class_loop: # #6, #15
= ['id', 'prefLabel', 'subClassOf', 'altLabel', 'definition', 'editorialNote']
header else:
= ['id', 'prefLabel', 'subClassOf', 'domain', 'range', 'functional', 'altLabel',
header 'definition', 'editorialNote']
csv_out.writerow(header) for value in loop_list: # #7
print(' . . . processing', value)
= eval(value) # #8
root = root.descendants() # #9, #15
p_set if root == kko.representations: # #10
p_set.remove(backwardCompatibleWith)
p_set.remove(deprecated)
p_set.remove(incompatibleWith)
p_set.remove(priorVersion)
p_set.remove(versionInfo)
p_set.remove(isDefinedBy)
p_set.remove(label)
p_set.remove(seeAlso)for p_item in p_set:
if p_item not in cur_list: # #11
= p_item.prefLabel
a_pref = str(a_pref)[1:-1].strip('"\'') # #12
a_pref = p_item.is_a
a_sup for a_id, a in enumerate(a_sup): # #13
= str(a)
a_item if a_id > 0:
= a_sup + '||' + str(a)
a_item = a_item
a_sup if loop == property_loop: # #3
= p_item.domain
a_dom = str(a_dom)[1:-1]
a_dom = p_item.range
a_rng = str(a_rng)[1:-1]
a_rng = ''
a_func = ''
a_item = p_item.altLabel
a_alt for a_id, a in enumerate(a_alt):
= str(a)
a_item if a_id > 0:
= a_alt + '||' + str(a)
a_item = a_item
a_alt = a_item
a_alt = p_item.definition
a_def = str(a_def).strip('[]')
a_def = p_item.editorialNote
a_note = str(a_note)[1:-1]
a_note if loop == class_loop: # #6
= (p_item,a_pref,a_sup,a_alt,a_def,a_note)
row_out else:
= (p_item,a_pref,a_sup,a_dom,a_rng,a_func,a_alt,a_def,a_note)
row_out # #1
csv_out.writerow(row_out)
cur_list.append(p_item)= x + 1
x print('Total rows written to file:', x) # #16
Beginning annotation extraction . . .
. . . processing kko.predicateProperties
. . . processing kko.predicateDataProperties
. . . processing kko.representations
Total rows written to file: 4843
Here are some of the specific changes to the routine above, keyed by number, to accommodate our current generic and DRY needs versus the first prototype presented in the earlier CWPK #30:
- We need to import the
csv
module at this point to make sure we can format longer text (definitions, especially) with the proper escaping of delimiting characters such as commas, etc. - We’re putting some temporary counters in to keep track of the number of items we process
- Our generic annotation extraction method allows us to specify whether we are processing classes or properties
- Our big, or outer, loop is to cycle over the entries in our starting dictionary. Each one of these is a root with a set of child elements
- Here is where we switch out the writer to enable proper escaping of large text strings, etc., for CSV
- We’re checking on whether it is classes or properties we are looping over, and switching the number of columns thus needed for the outputs. The next code enables us to put a single-row header in our CSV files to label the output fields
- We take the big chunks of the combined roots in our starting dictionaries
- And we convert them to strings for easier later manipulation (also see the prior installment for cautions about eh
eval()
method - The heart of this routine is to grab all of the descendant sub-items from our starting root
- This is a temporary kludge because possibly namespace or assignment errors require us to trap these annotations from our standard set; these properties are all part of the starting core KKO ontology ‘stub’
- Since there are many duplicates across our groupings, this check ensures we are only adding new assignments to our results. It effectively is a duplicate-removal routine
- We need to make some one-off string changes in order for our actual output to conform to an expected CSV file
- As discussed in prior CWPK installments, some record fields allow for more than one entry. This general routine loops over those sub-set members, making the format changes and commitments as indicated
- This part of the code block will be moved to the
setup.py
module, since how we want to render our extractions will be shared across modules - Will move all of these items to the
config.py
module - A little feedback for grins.
If you inspect the code base, for example, you will see that many of the parts above have been broken out into different files.
BTW, if you want to see the members of the outer loop set, you can do so with this code snippet (set your own root):
= kko.representations
root = root.descendants()
p_set print(p_set)
= len(p_set)
length print(length)
Based on the changes described in the comment notes and embedding this generic annotation routine into its own method, annot_extractor
, will end up with this deployed code structure:
__main__.py material
config.py material
def annot_extractor (arg1, arg2)
We’re now ready to migrate this notebook code to a formal Python package and to extend the method to the structure extractor, the topics of our next installment.
Additional Documentation
Style guidelines and coding standards should be near at hand whenever you are writing code. That is because code is meant to be shared and understood, and conventions and lessons regarding readability are a key part of that. Here are some references useful for whatever work you choose to do with Python:
- Python Style Guide (PEP 8)
- Python ‘docstring’ Conventions (PEP 257)
- DEV’s Python Project Structure and Imports (great!)
- The Best of the Best Practices (BOBP) Guide for Python
- The Hitchhiker’s Guide to Structuring Your Project.
*.ipynb
file. It may take a bit of time for the interactive option to load.