Moving from Notebook to Package Proved Perplexing
This installment of the Cooking with Python and KBpedia series is the second of a three-part mini-series on writing and packaging a formal Python project. The previous installment described a DRY (don’t repeat yourself) approach to how to generalize our annotation extraction routine. This installment describes how to transition that code from Jupyter Notebook interactive code to a formally organized Python package. We also extend our generalized approach to the structure extractor.
In this installment I am working with the notebook and the Spyder IDE in tandem. The notebook is the source of the initial prototype code. It is also the testbed for seeing if the package may be imported and is working properly. We use Spyder for all of the final code development, including moving into functions and classes and organizing by files. We also start to learn some of its IDE features, such as auto-complete, which is a nice way to test questions about namespaces and local and global variables.
As noted in earlier installments, a Python ‘module’ is a single script file (in the form of my_file.py
) that itself may contain multiple functions, variable declarations, class (object) definitions, and the like, kept in this single file because of their related functionality. A ‘package’ in Python is a directory with at least one module and (generally) a standard __init__.py
file that informs Python a package is available and its name. Python packages and modules are named with lower case. A package name is best when short and without underscores. A module may use underscores to better convey its purpose, such as do_something.py
.
For our project based on Cooking with Python and KBpedia (CWPK), we will pick up on this acronym and name our project ‘cowpoke‘. The functional module we are starting the project with is extract.py
, the module for the extraction routines we have been developing over the past few installments..
Perplexing Questions
While it is true the Python organization has some thorough tutorials, referenced in the concluding Additional Documentation, I found it surprisingly difficult to figure out how to move my Jupyter Notebook prototypes to a packaged Python program. I could see that logical modules (single Python scripts, *.py
) made sense, and that there were going to be shared functions across those modules. I could also see that I wanted to use a standard set of variable descriptions in order to specify ‘record-like’ inputs to the routines. My hope was to segregate all of the input information required for a new major exercise of cowpoke into the editing of a single file. That would make configuring a new run a simple process.
I read and tried many tutorials trying to figure out an architecture and design for this packaging. I found the tutorials helpful at a broad, structural level of what goes into a package and how to refer and import other parts, but the nuances of where and how to use classes and functions and how to best share some variables and specifications across modules remained opaque to me. Here are some of the questions and answers I needed to discover before I could make progress:
1. Where do I put the files to be seen by the notebook and the project?
After installing Python and setting up the environment noted in installments CWPK #9 – #11 you should have many packages already on your system, including for Spyder and Jupyter Notebook. There are at least two listings of full packages in different locations. To re-discover what your Python paths are, Run this cell:
import sys
print(sys.path)
You want to find the site packages directory under your Python library (mine is C:\1-PythonProjects\Python\lib\site-packages
). We will define the ‘cowpoke‘ directory under this parent and also point our Spyder project to it. (NB: Of course, you can locate your package directory anywhere you want, but you would need to add that location to your path as well, and later configuration steps may also require customization.)
2. What is the role of class and defined variables?
I know the major functions I have been prototyping, such as the annotation extractor from the last CWPK #33 installment, need to be formalized as a defined function (the def function_name
statement). Going into this packaging, however, it is not clear to me whether I should package multiple function definitions under one class (some tutorials seem to so suggest) or where and how I need to declare variables such as loop that are part of a run configuration.
One advantage of putting both variables and functions under a single class is that they can be handled as a unit. On the other hand, having a separate class of only input variables seems to be the best arrangement for a record orientation (see next question #4). In practice, I chose to embrace both types.
3. What is the role of self
and where to introduce or use?
The question of the role of self
perplexed me for some time. On the one hand, self
is not a reserved keyword in Python, but it is used frequently by convention. Class variables come in two flavors. One flavor is when the variable value is universal to all instances of class. Every instance of this class will share the same value for this variable. It is declared simply after first defining the class and outside of any methods:
variable = my_variable
In contrast, instance variables, which is where self
is used, are variables with values specific to each instance of class. The values of one instance typically vary from the values of another instance. Class instance variables should be declared within a method, often with this kind of form, as this example from the Additional Documentation shows:
class SomeClass:
variable_1 = “ This is a class variable”
variable_2 = 100 #this is also a class variable.
def __init__(self, param1, param2):
self.instance_var1 = param1
#instance_var1 is a instance variable
self.instance_var2 = param2
#instance_var2 is a instance variable
In this recipe, we are assigning self
by convention to the first parameter of the function (method). We can then access the values of the instance variable as declared in the definition via the self
convention, also without the need to pass additional arguments or parameters, making for simpler use and declarations. (NB: You may name this first parameter something other than self
, but that is likely confusing since it goes against the convention.)
Importantly, know we may use this same approach to assign self
as the first parameter for instance methods, in addition to instance variables. For either instance variables or methods, Python explicitly passes the current instance and its arguments (self
) as the first argument to the instance call.
At any rate, for our interest of being able to pass variable assignments from a separate config.py
file to a local extraction routine, the approach using the universal class variable is the right form. But, is it the best form?
4. What is the best practice for initializing a record?
If one stands back and thinks about what we are trying to do with our annotation extraction routine (as with other build or extraction steps), we see that we are trying to set a number of key parameters for what data we use and what branches we take during the routine. These parameters are, in effect, keywords used in the routines, the specific values of which (sources of data, what to loop over, etc.) vary by the specific instance of the extraction or build run we are currently invoking. This set-up sounds very much like a kind of ‘record’ format where we have certain method fields (such as output file or source of the looping data) that vary by run. This is equivalent to a key:value pair. In other words, we can treat our configuration specification as the input to a given run of the annotation extractor as a dictionary (dict
) as we discussed in the last installment. The dict
form looks to be the best form for our objective. We’ll see this use below.
5. What are the special privileges about __main__.py
?
Another thing I saw while reading the background tutorials was reference to a more-or-less standard __main.__.py
file. However, in looking at many of the packages installed in my current Python installation I saw that this construct is by no means universally used, though some packages do. Should I be using this format or not?
For two reasons my general desire is to remove this file. The first reason is because this file can be confused with the __main__
module. The second reason is because I could find no real clear guidance about best practices for the file except to keep it simple. That seemed to me thin gruel for keeping something I did not fully understand and found confusing. So, I initially decided not to use this form.
However, I found things broke when I tried to remove it. I assume with greater knowledge or more experience I might find the compelling recipe for simplifying this file away. But, it is easier to keep it and move on rather than get stuck on a question not central to our project.
6. What is the best practice for arranging internal imports across a project?
I think one of the reasons I did not see a simple answer to the above question is the fact I have not yet fully understood the relationships between global and local variables and module functions and inheritance, all of which require a sort of grokking, I suppose, of namespaces.
I plan to continue to return to these questions as I learn more with subsequent installments and code development. If I encounter new insights or better ways to do things, my current intent is to return to any prior installments, leave the existing text as is, and then add annotations as to what I learned. If you have not seen any of these notices by now, I guess I have not later discovered better approaches. (Note: I think I began to get a better understanding about namespaces on the return leg of our build ’roundtrip’, roughly about CWPK #40 from now, but I still have questions, even from that later vantage point.)
New File Definitions
As one may imagine, the transition from notebook to module package has resulted in some changes to the code. The first change, of course, was to split the code into the starting pieces, including adding the __init__.py
that signals the available cowpoke package. Here is the new file structure:
|-- PythonProject
|-- Python
|-- [Anaconda3 distribution]
|-- Lib
|-- site-packages # location to store files
|-- alot
|-- cowpoke # new project directory
|-- __init__.py # four new files here
|-- __main__.py
|-- config.py
|-- extract.py
|-- TBA
|-- TBA
At the top of each file we place our import statements, including references to other modules within the cowpoke project. Here is the statement at the top of __init__.py
(which also includes some package identification boilerplate):
from cowpoke.__main__ import *
from cowpoke.config import *
from cowpoke.extract import *
I should note that the asterisk (*) character above tells the system to import all objects within the file, a practice that is generally not encouraged, though is common. It is discouraged because of the amount of objects brought into a current working space, which may pose name conflicts or a burdened system for larger projects. However, since our system is quite small and I do not foresee unmanageable namespace complexity, I use this simpler shorthand.
Our __main__.py
contains the standard start-up script that we have recently been using for many installments. You can see this code and the entire file by Running the next cell (assuming you have been following this entire CWPK series and have stored earlier distribution files):
#
) out.with open(r'C:\1-PythonProjects\Python\Lib\site-packages\cowpoke\__main__.py', 'r') as f:
print(f.read())
(NB: Remember the ‘r
‘ switch on the file name is to treat the string as ‘raw’.)
We move our dictionary definitions to the config.py
. Go ahead and inspect it in the next cell, but realized much has been added to this file due to subsequent coding steps in our project installments:
with open(r'C:\1-PythonProjects\Python\Lib\site-packages\cowpoke\config.py', 'r') as f:
print(f.read())
We already had the class and property dictionaries as presented in the CWPK #33 installment. The key change notable for the config.py
, which remember is intended for where we enter run specifications for a new run (build or extract) of the code, was to pull out our specifications for the annotation extractor. This new dictionary, the extract_deck
, is expanded later to embrace other run parameters for additional functions. At the time of this initial set-up, however, the dictionary contained these relatively few entries:
extract_deck = {
"""This is the dictionary for the specifications of each
extraction run; what is its run deck.
"""
'property_loop' : '',
'class_loop' : '',
'loop' : 'property_loop',
'loop_list' : prop_dict.values(),
'out_file' : 'C:/1-PythonProjects/kbpedia/sandbox/prop_annot_out.csv',
}
These are the values passed to the new annotation extraction function, def annot_extractor
, now migrated to the extract.py
module. Here is the commented code block (which will not run on its own as a cell):
def annot_extractor(**extract_deck): # define the method here, see note
print('Beginning annotation extraction . . .')
= extract_deck.get('loop_list') # notice we are passing run_deck to current vars
loop_list = extract_deck.get('loop')
loop = extract_deck.get('out_file')
out_file = extract_deck.get('class_loop')
class_loop = extract_deck.get('property_loop')
property_loop = ''
a_dom = ''
a_rng = ''
a_func """ These are internal counters used in this module's methods """
= ''
p_set = 1
x = []
cur_list with open(out_file, mode='w', encoding='utf8', newline='') as output:
= csv.writer(output)
csv_out # remainder of code as prior installment . . . ...
Note: Normally, a function definition is followed by its arguments in parentheses. The special notation of the double asterisks (**) signals to expect a variable list of keywords (more often in tutorials shown as ‘**kwargs
‘), which is how we make the connection to the values of the keys in the extract_deck
dictionary. We retrieve these values based on the .get()
method shown in the next assignments. Note, as well, that positional arguments can also be treated in a similar way using the single asterisk (*
) notation (‘*args
‘).
At the command line or in an interactive notebook, we can run this function with the following call:
import cowpoke
cowpoke.annot_extractor(**cowpoke.extract_deck)
We are not calling it here given that your local config.py
is not set up with the proper configuration parameters for this specific example.
These efforts complete our initial set-up on the Python cowpoke package.
Generalizing and Moving the Structure Extractor
You may want to relate the modified code in this section to the last state of our structure extraction routine, shown as the last code cell in CWPK #32.
We took that code, applied the generalization approaches earlier discussed, and added a set.union
method to getting the unique list from a very large list of large sets. This approach using sets (that can be hashed) sped up what had been a linear lookup by about 10x. We also moved the general parameters to share the same extract_deck
dictionary.
We made the same accommodations for processing properties v classes (and typologies). We wrapped the resulting code block into a defined function wrapper, similar for what we did for annotations, only now for (is-a) structure:
from owlready2 import *
from cowpoke.config import *
from cowpoke.__main__ import *
import csv
import types
= World()
world
= []
kko = []
kb = []
rc = []
core = []
skos = master_deck.get('kb_src') # we get the build setting from config.py
kb_src
if kb_src is None:
= 'standard'
kb_src if kb_src == 'sandbox':
= 'C:/1-PythonProjects/kbpedia/sandbox/kbpedia_reference_concepts.owl'
kbpedia = 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl'
kko_file elif kb_src == 'standard':
= 'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts.owl'
kbpedia = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kko.owl'
kko_file elif kb_src == 'extract':
= 'C:/1-PythonProjects/kbpedia/v300/build_ins/ontologies/kbpedia_reference_concepts.owl'
kbpedia = 'C:/1-PythonProjects/kbpedia/v300/build_ins/ontologies/kko.owl'
kko_file elif kb_src == 'full':
= 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kbpedia_rc_stub.owl'
kbpedia = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kko.owl'
kko_file else:
print('You have entered an inaccurate source parameter for the build.')
= 'http://www.w3.org/2004/02/skos/core'
skos_file
= world.get_ontology(kbpedia).load()
kb = kb.get_namespace('http://kbpedia.org/kko/rc/')
rc
= world.get_ontology(skos_file).load()
skos
kb.imported_ontologies.append(skos)= world.get_namespace('http://www.w3.org/2004/02/skos/core#')
core
= world.get_ontology(kko_file).load()
kko
kb.imported_ontologies.append(kko)= kb.get_namespace('http://kbpedia.org/ontologies/kko#') kko
def struct_extractor(**extract_deck):
print('Beginning structure extraction . . .')
= extract_deck.get('loop_list')
loop_list = extract_deck.get('loop')
loop = extract_deck.get('out_file')
out_file = extract_deck.get('class_loop')
class_loop = extract_deck.get('property_loop')
property_loop = 1
x = []
cur_list = []
a_set = []
s_set # r_default = '' # Series of variables needed later
# r_label = '' #
# r_iri = '' #
# render = '' #
= 'owl:Thing'
new_class with open(out_file, mode='w', encoding='utf8', newline='') as output:
= csv.writer(output)
csv_out if loop == class_loop:
= ['id', 'subClassOf', 'parent']
header = 'rdfs:subClassOf'
p_item else:
= ['id', 'subPropertyOf', 'parent']
header = 'rdfs:subPropertyOf'
p_item
csv_out.writerow(header) for value in loop_list:
print(' . . . processing', value)
= eval(value)
root = root.descendants()
a_set = set(a_set)
a_set = a_set.union(s_set)
s_set print(' . . . processing consolidated set.')
for s_item in s_set:
= s_item.is_a
o_set for o_item in o_set:
= (s_item,p_item,o_item)
row_out
csv_out.writerow(row_out)if loop == class_loop:
if s_item not in cur_list:
= (s_item,p_item,new_class)
row_out
csv_out.writerow(row_out)
cur_list.append(s_item)= x + 1
x print('Total rows written to file:', x)
**extract_deck) struct_extractor(
Beginning structure extraction . . .
. . . processing kko.predicateProperties
. . . processing kko.predicateDataProperties
. . . processing kko.representations
. . . processing consolidated set.
Total rows written to file: 9670
Again, since we can not guarantee the operating circumstance, you can try this on your own instance with the command:
cowpoke.struct_extractor(**cowpoke.extract_deck)
Note we’re using a prefixed cowpoke function to make the generic dictionary request. All we need to do before the run is to go to the config.py
file, and make the value (right-hand side) changes to the extract_deck
dictionary. Save the file, make sure your current notebook instance has been cleared, and enter the command above.
There aren’t any commercial-grade checks here to make sure you are not inadvertently overwriting a desired file. Loose code and routines such as what we are developing in this CWPK series warrant making frequent backups, and scrutinizing your config.py
assignments before kicking off a run.
Additional Documentation
Here are additional guides resulting from the research in today’s installation:
- Python’s Class and Instance Variable documentation
- Understanding self in Python
- PythonTips’ The self variable in python explained
- DEV’s class v instance variables
- Programiz’ self in Python, Demystified
- StackOverflow’s What is __main__.py?
- See StackOverflow for a nice example of the advantage of using sets to find unique items in a listing.
*.ipynb
file. It may take a bit of time for the interactive option to load.
You don’t need to manage the /site_packages/dir Instead, consider the “python setup.py develop” command to support in-progress packages & sys.path support.
Thanks, Jon. Do you have a link or further info to describe this option?
Mike