More Fields, But Less Complexity
We now tackle the ingest of annotations for classes and properties in this installment of the Cooking with Python and KBpedia series. In prior installments we built the structural aspects of KBpedia. We now add the labels, definitions, and other assignments to them.
As with the extraction routines, we will split these efforts into class annotations and then property annotations. Our actual load routines are fairly straightforward, and we have no real logic concerns in how these annotations get added. The most complex wrinkle we will need to address are those annotation fields, altLabels
and notes
in particular, where we have potentially many assignments for a single reference concept (RC) or property. Like we saw with the extraction routines, for these items we will need to set up additional internal loops to segregate and assign the items for loading based on our standard double-pipe (‘||’) delimiter.
The two functions we develop in this installment, class_annot_builder
and prop_annot_builder
will be added to the build.py
module.
Start-up
Since we are in an active part of the build cycle, we want to continue with our main knowledge graph in-progress for our load routine, so please make sure that kb_src
is set to ‘standard’ in your config.py
configuration. We then invoke our standard start-up:
from cowpoke.__main__ import *
from cowpoke.config import *
Loading Class Annotations
Class annotations consist of potentially the item’s prefLabel
, altLabels
, definition
, and editorialNote
. The first item is mandatory, the next two should be provided to adhere to best practices. The last is optional. There are, of course, other standard annotations possible. Should your own conventions require or encourage them, you will likely need to modify the procedure below to account for that fact.
As with these methods before, we provide a header showing ‘typical’ configuration settings (in config.py
), and then proceed with a method that loops through all of the rows in the input file. Here is the basic class annotation build procedure. There are no new wrinkles in this routine from what has been seen previously:
### KEY CONFIG SETTINGS (see build_deck in config.py) ###
# 'kb_src' : 'standard'
# 'loop_list' : file_dict.values(), # see 'in_file'
# 'loop' : 'class_loop',
# 'in_file' : 'C:/1-PythonProjects/kbpedia/v300/build_ins/classes/Generals_annot_out.csv',
# 'out_file' : 'C:/1-PythonProjects/kbpedia/v300/target/ontologies/kbpedia_reference_concepts_test.csv',
def class_annot_build(**build_deck):
print('Beginning KBpedia class annotation build . . .')
= build_deck.get('loop_list')
loop_list = build_deck.get('loop')
loop = build_deck.get('class_loop')
class_loop # r_id = ''
# r_pref = ''
# r_def = ''
# r_alt = ''
# r_note = ''
if loop is not 'class_loop':
print("Needs to be a 'class_loop'; returning program.")
return
for loopval in loop_list:
print(' . . . processing', loopval)
= loopval
in_file with open(in_file, 'r', encoding='utf8') as input:
= True
is_first_row = csv.DictReader(input, delimiter=',', fieldnames=[C])
reader for row in reader:
= row['id']
r_id_frag id = getattr(rc, r_id_frag)
if id == None:
print(r_id_frag)
continue
= row['prefLabel']
r_pref = row['altLabel']
r_alt = row['definition']
r_def = row['editorialNote']
r_note if is_first_row:
= False
is_first_row continue
id.prefLabel.append(r_pref)
id.definition.append(r_def)
= r_alt.split('||')
i_alt if i_alt != ['']:
for item in i_alt:
id.altLabel.append(item)
= r_note.split('||')
i_note if i_note != ['']:
for item in i_note:
id.editorialNote.append(item)
print('KBpedia class annotation build is complete.')
**build_deck) class_annot_build(
file=r'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts_test.owl', format='rdfxml') kb.save(
BTW, when we commit this method to our build.py
module, we will add the save routine at the end.
Loading Property Annotations
We now turn our attention to annotations of properties:
### KEY CONFIG SETTINGS (see build_deck in config.py) ###
# 'kb_src' : 'standard'
# 'loop_list' : prop_dict.values(), # see 'in_file'
# 'loop' : 'class_loop',
# 'in_file' : 'C:/1-PythonProjects/kbpedia/v300/build_ins/properties/prop_annot_out.csv',
# 'out_file' : 'C:/1-PythonProjects/kbpedia/v300/target/ontologies/kbpedia_reference_concepts_test.csv',
def prop_annot_build(**build_deck):
print('Beginning KBpedia property annotation build . . .')
= build_deck.get('loop_list')
loop_list = build_deck.get('loop')
loop = build_deck.get('out_file')
out_file if loop is not 'property_loop':
print("Needs to be a 'property_loop'; returning program.")
return
for loopval in loop_list:
print(' . . . processing', loopval)
= loopval
in_file with open(in_file, 'r', encoding='utf8') as input:
= True
is_first_row = csv.DictReader(input, delimiter=',', fieldnames=['id', 'prefLabel', 'subPropertyOf', 'domain',
reader 'range', 'functional', 'altLabel', 'definition', 'editorialNote'])
for row in reader:
= row['id']
r_id = row['prefLabel']
r_pref = row['domain']
r_dom = row['range']
r_rng = row['altLabel']
r_alt = row['definition']
r_def = row['editorialNote']
r_note = r_id.replace('rc.', '')
r_id id = getattr(rc, r_id)
if id == None:
print(r_id)
continue
if is_first_row:
= False
is_first_row continue
id.prefLabel.append(r_pref)
= r_dom.split('||')
i_dom if i_dom != ['']:
for item in i_dom:
id.domain.append(item)
if 'owl.' in r_rng:
= r_rng.replace('owl.', '')
r_rng = getattr(owl, r_rng)
r_rng id.range.append(r_rng)
elif r_rng == ['']:
continue
else:
# id.range.append(r_rng)
= r_alt.split('||')
i_alt if i_alt != ['']:
for item in i_alt:
id.altLabel.append(item)
id.definition.append(r_def)
= r_note.split('||')
i_note if i_note != ['']:
for item in i_note:
id.editorialNote.append(item)
print('KBpedia property annotation build is complete.')
**build_deck) prop_annot_build(
Hmmm. One of the things we notice in this routine is that our domain
and range
assignments have not been adequately picked up in our earlier KBpedia version 2.50 build routines (the ones undertaken in Clojure before this CWPK series). As a result, we can not adequately test range
and will need to address this oversight before our series is over.
As before, we will add our ‘save’ routine as well when we commit the method to the build.py
module.
file=r'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts_test.owl', format='rdfxml') kb.save(
We now have all of the building blocks to create our extract-build roundtrip. We summarize the formal steps and configuration settings in CWPK #47. But, first, we need to return to cleaning our input files and instituting some unit tests.
*.ipynb
file. It may take a bit of time for the interactive option to load.
I am a bit confused about having loop_list be referring to file_dict.values() because in config.py file_dict only has ‘wikipedia-categories’ as is shown in https://github.com/Cognonto/cowpoke/blob/master/config.py
Should we have in_file not be the items in loop_list for both of these methods? It seems like file_dict won’t refer to these. I guess what may have been intended is to have file_dict refer to the in_file in both methods.
Hi Varun,
Another good catch. I think the proper value here is prop_dict.values(). I must have not updated the setting after an earlier test. I will update the Binder files tomorrow when I post the next installment. I have changed the blog posting here.
Thanks, Mike
I got a bit busy during this past month and just began going through the CWPK series again. Trying to read this again, and I couldn’t understand what file_dict.values() should be for class_annot_build.
Moreover, I was a bit confused why in both functions we use in_file = loopval here. The reason I’m confused is that we want to have ‘in_file’ as part of the build_deck dictionary. So according to the code in here (and in cowpoke), we’d want to open the file which is loopval (in the case for prop_annot_build), so this means that we are trying to open ‘kko.predicateProperties’, ‘kko.predicateDataProperties’, ‘kko.representations’. I thought that we’d want in_file to be C:/1-PythonProjects/kbpedia/v300/build_ins/other_stuff.
Sorry for asking about this again, I’m just a bit confused.
Nevermind, I understood after reading CWPK 47.
Hi Varun,
Good; I’m glad you figured it out. However, I would not be surprised if some of the header instruction information is wrong or lists unused variables. (Perhaps a leftover from cut-and-paste.) If you care to cite the specific CWPK and routine, I will look at it to make sure the instructions are accurate.
Thanks!
Well I was looking at CWPK 47 and was looking at both prop2_annot_build and class2_annot_build. The loop_list for both of these are supposed to be the one item in in_file, correct?