Out of Sequence, But Reducing ‘Garbage’ Always Makes Sense
We have noted in previous installments in this Cooking with Python and KBpedia series how important consistent UTF-8 encoding is to roundtripping with our files. One of the ways we can enforce this importance is to consistently read and write files with UTF-8 specified, as discussed in CWPK #31. But, what if we have obtained external information? How can we ensure it is in the proper encoding or has wrong character assignments fixed? If we are going to perform such checks, what other consistency tests might we want to include? In this installment, we add some pre-build routines to test and clean our files for proper ingest.
As I noted in CWPK #39, cleaning comes before the build steps in the actual build process. But we wanted to have an understanding of broader information flows throughout the build or use scenarios before formulating the cleaning routines. That is both because they are not always operationally applied, and because working out the build steps was aided by not having to carry around extra routines. Now that we have the ingest and build steps fairly well outlined, it is an easier matter to see where and how cleaning steps best fit into this flow.
At the outset, we know we want to work with clean files when building KBpedia. Do we want such checks to run in every build, or optionally? Do we want to run checks against single files or against entire directories or projects? Further, are we not likely to want to add more checks over time as our experience with the build process and problems encountered increase? Lastly, we can see down the road (CWPK #48) to where we also only want to make incremental changes to an existing knowledge graph, as opposed to building one from scratch or de novo. How might that affect cleaning requirements or placement of methods?
Design Considerations
In thinking about these questions, we decided to take this general approach to testing and vetting clean files:
-
Once vetted, files will remain clean (insofar as the tests run) until next edited. It may not make sense to check all files automatically at the beginning of a build. This point suggests we should have a separate set of cleaning routines from the overall build process. We may later want to include that into an overall complete build routine, but we can do so later as part of a make file approach rather than including cleaning as a mandatory part of all builds.
-
Once we have assembled our files for a new build, we should assume that all files are unvetted. As build iterations proceed, we only need to vet those files that have been modified. When initially testing a new build, it probably makes sense for us to be able to loop over all of the input files in a given directory (corresponding to most of the subdirectories under
kbpedia > version > build
; see prior CWPK #37 installment). These points suggest we want the option to configure our clean routines for either all files in a subdirectory or a list of files. To keep configuration complexity lower, we will stipulate that if a list of files is used, they should all be in the same subdirectory. -
Our biggest cleaning concern is that we have clean, UTF-8 text (encodings) in all of our input files. However, if we need to run this single test, we ought to test for other consistency concerns, as well. Here are the additional tests that look useful in our initial module development:
- Have new fields (columns) been added to our CSV files?
- Are our input files missing already defined fields?
- Are we missing required fields (
prefLabel
anddefinition
)? - Are our fields properly constructed (CamelCase with initial cap for classes, initial lowercase for properties, and URI encoding for IRIs)?
-
If we do have encoding issues, and given the manual effort required to fix them, can we include some form of encoding ‘fix’ routine? It turns out there is a Python package for such a routine, that we will test in this installment and include if deemed useful.
These considerations are what have guided the design of the cowpoke clean
module. Also, as we noted in CWPK #9, our design is limited to Python 3x. Python 2 has not been accommodated in cowpoke.
A Brief Detour for URIs
KBpedia is a knowledge graph based on semantic technologies and which incorporates seven major public and online knowledge bases: Wikipedia, Wikidata, DBpedia, schema.org, GeoNames, UNSPSC, and OpenCyc. A common aspect of all of these sources is that reference to information is a Web string that ‘identifies’ the item at hand that, when clicked, also takes us to the source of that item. In the early days of the Web this identifier mostly pertained to Web pages and was known as a Universal Resource Locator, or URL. They were the underlined blue links of the Web’s early days.
But, there are other protocols for discovering resources on the Internet beside the Web protocols of HTTP and HTTPS. There is Gopher, FTP, email, and others. Also, as information began to proliferate from Web pages to data items within databases and these other sources, the idea of a ‘locator’ was generalized to include ‘identifiers’ when it is a data item and not a page. This generalization is known as a URI, or if a ‘name’ within other schema or protocols, known as a URN. Here, for example, is the URI address of the English Wikipedia main page:
https://en.wikipedia.org/wiki/Main_Page
Note that white space is not allowed in this string, and is replaced with underscores in this example.
The allowed characters that could be used in constructing one of these addresses were limited to mostly ASCII characters, with some characters like the forward-slash (‘/’) forbidden because they are a defined constructor of an address. If one wanted to include non-allowed characters in a URI address, it needed to be percent encoded. Here, for example, is the English Wikipedia address for its article on the Côte d’Azur Observatory:
https://en.wikipedia.org/wiki/C%C3%B4te_d%27Azur_Observatory
This format is clearly hard to read. Most Web browsers, for example, decode these strings when you look at the address within the browser, so it appears as this:
https://en.wikipedia.org/wiki/Côte_d'Azur_Observatory
And, in fact, if you submit the string as exactly shown above, encoders at Wikipedia would accept this input string.
The Internationalized Resource Identifier (IRI was proposed and then adopted on the Web as a way of bringing in a wider range of acceptable characters useful to international languages. Mostly what we see in browsers today is the IRI version of these addresses, even if not initially formulated as such.
Sources like Wikipedia and Wikidata restrict their addresses to URIs. A source like DBpedia, on the other hand, supports IRIs. Wikipedia also has a discussion on how to fix these links.
The challenge in these different address formats is that if encoding gets screwed up, IRI versions of addresses can also get screwed up. That might be marginally acceptable when we are encoding something like a definition or comment (an annotation), but absolutely breaks the data record if it occurs to that record’s identifying address: Any change or alteration of the exact characters in the address means we can no longer access that data item.
Non-percent encoded Wikipedia addresses and DBpedia addresses are two problem areas. We also have tried to limit KBpedia’s identifiers to the ASCII version of these international characters. For example, the KBpedia item for Côte-d’Or shows as the address:
http://kbpedia.org/kko/rc/CoteDOr
We still have a readable label, but one with encoding traps removed.
I provide this detour to highlight that we also need to give special attention in our clean
module to how Web addresses are coming in to the system and being treated. We obviously want to maintain the original addresses as supplied by the respective external sources. We also want to test and make sure these have not been improperly encoded. And we also want to test that our canonical subset of characters used in KBpedia is being uniformly applied to our own internal addresses.
Encoding Issues and ftfy
Despite it being design point #4 above, let’s first tackle the question of whether encoding fixes may be employed. I move it up the list because it is also the best way to illustrate why encoding issues are at the top of our concerns. First, let’s look at 20 selected records from KBpedia annotations that contain a diversity of language and symbol encodings.
https://github.com/Cognonto/CWPK/tree/master/sandbox/builds/working.
These three files are part of the cowpoke distribution. This first file is the starting set of 20 selected records (remember Run or shift+enter
to run the cell):
with open(r'C:\1-PythonProjects\kbpedia\v300\builds\working\annotations_orig.csv', 'r', encoding='utf8') as f:
print(f.read())
However, here is that same file when directly imported into Excel and then saved (notice we had to change the encoding to get the file to load in Python):
with open(r'C:\1-PythonProjects\kbpedia\v300\builds\working\annotations_excel.csv', 'r', encoding='cp1252') as f:
print(f.read())
Wow, did that file ever get screwed up! (You will obviously need to change the file locations to match your local configuration.) In fact, there are ways to open CSV files properly in Excel by first firing up the application and then using the File → Open dialogs, but the form above occurs in English MS Excel when you open the file directly, make a couple of changes, and then save. If you do not have a backup, you would be in a world of hurt.
So, how might we fix this file, or can we? The first thing to attempt is to load the file with the Python encoding set to UTF-8. Indeed, in many cases, that is sufficient to restore the proper character displays. One thing that is impressive in the migration to Python 3.6 and later is tremendously more forgiving behavior around UTF-8. That is apparently because of the uniform application now of UTF-8 across Python, plus encoding tests that occur earlier when opening files than occurred with prior versions of Python.
But in instances where this does not work, the next alternative is to use ftfy (fixes text for you). The first thing we need to do is to import the module, which is already part of our conda distribution (see CWPK #9):
import ftfy
Then, we can apply ftfy
methods (of which there are many useful ones!) to see if we can resurrect that encoding-corrupted file from Excel:
import io
with io.open(r'C:\1-PythonProjects\kbpedia\v300\builds\working\annotations_excel.csv', encoding='utf-8', mode='r', errors='ignore',) as f:
= f.readlines()
lines print(lines)
= [ftfy.fix_text(line) for line in lines]
fixed_lines print(fixed_lines)
# so you may inspect the results, but we will also write it to file:
with io.open(r'C:\1-PythonProjects\kbpedia\v300\builds\working\annotations_fixed.csv', encoding='utf-8', mode='w',) as out:
print(fixed_lines, file=out)
I have to say this is pretty darn impressive! We have recovered nearly all of the original formats. Now, it is the case there are some stoppers in the file, which is why we needed to incorporate the more flexible io
method of opening the file to be able to ignore the errors. Each of the glitches that occur in the file still need to be manually fixed. But, we can also use the ‘replace’ as a different argument to ‘error’ to insert some known characters to more quickly find these glitches. Overall, this is a much reduced level of effort to fix the file than without ftfy
. We have moved from a potentially catastrophic situation to one that is an irritant to fix. That is progress!
Just to confirm (and for which one could do file compares to see specific differences to also help in the manual corrections), here is our now ‘fixed’ output file:
with open(r'C:\1-PythonProjects\kbpedia\v300\builds\working\annotations_fixed.csv', 'r', encoding='utf-8') as f:
print(f.read())
We can also inspect our files as to what encoding we think it has. Again, we use an added package, chardet
in this case, to test any suspect file. Here is the general form:
import chardet
with open(r'C:\1-PythonProjects\kbpedia\v300\builds\working\annotations_fixed.csv', 'rb') as rawdata:
= chardet.detect(rawdata.read(10000))
result
# check what the character encoding might be
print(result)
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
Note that one of the arguments is to pass the first 10,000 characters to the method as the basis for estimating the encoding type. Since the routine is quick, there is really no reason to lower this amount, and higher does not seem to provide any better statistics.
Again, a gratifying aspect of the improvements to Python since version 3.6 or so has been a more uniform approach to UTF-8. We also see we have some tools at our disposal, namely ftfy
, that can help us dig out of holes that prior encoding mistakes may have dug. In our early years when encoding mismatches were more frequent, we also developed a Clojure routine for fixing bad characters (or at least converting them to a more readable form). It is likely this routine is no longer needed with Python’s improved handling of UTF-8. However, if this is a problem for your own input files, you can import the unicodedata
module for the Python standard library to convert accented (diacritic) characters to ones based on ASCII. Here is the basic form of that procedure:
import unicodedata
def remove_diacrits(input_str):
= unicodedata.normalize('NFD', input_str).encode('ascii', 'ignore')\
input_str 'utf-8')
.decode(return str(input_str)
= remove_diacrits("Protégé")
s
print(s)
Protege
You can embed that routine in a CSV read that also deals with entire rows at a time, similar to some of the other procedures noted here.
However, the best advice, as we have reiterated, is to make sure that files are written and opened in UTF-8. But, it is good to know if we encounter encoding issues in the wild, that both Python and some of its great packages stand ready to help rectify matters (or at least partially so, with less pain). We have also seen how encoding problems can often be a source of garbage input data.
Flat File Checks
Though Python routines could be written for the next points below, they may be easier to deal with directly in a spreadsheet. This is OK, since we are also at that point in our roundtripping where we are dealing directly with CSV files anyway.
To work directly with the sheet, highlight the file’s entire set of rows and columns that are intended for eventual ingest during a build. Give that block a logic name in the upper left text box entry directly above the sheet, such as ‘Match’ or ‘Big’. You can continue to invoke that block name to re-highlight your subject block. From there, are can readily sort for the specific input column of interest in order to inspect the entire row of values.
Here is my checklist for such flat file inspection:
-
Does any item in the ‘id’ column lack a URI fragment identifier? If so, provide using the class and property URI naming conventions in KBpedia (CamelCase in both instances, upper initial case for classes, lower initial case for properties, with only alphanumerics and underscore as allowable characters). Before adding a new ‘id’, make sure it is initially specified in one of the class or property
struct
input files -
Does any item in the ‘prefLabel’ column lack a preferred label? If so, add one; this field is mandatory
-
Does any item in the ‘definition’ column lack an entry? If so, add one. Though this field is not mandatory, it is highly encouraged
-
Check a few rows. Does any column entry have leading or trailing white spaces? If so, use the spreadsheet TRIM function
-
Check a few rows. Do any of the files with a ‘definition’ column show its full text spread over more than one cell? If so, you have an upstream CSV processing issue that is splitting entries at the common or some other character that should be escaped. The best fix, if intermediate processing has not occurred, is to re-extract the file with correct CSV settings. If not, you may need to concatenate multiple cells in a row in order to re-construct the full string
-
URI fragment identifier? If so, provide using the class and property URI naming conventions in KBpedia (CamelCase in both instances, upper initial case for classes, lower initial case for properties, with only alphanumerics and underscore as allowable characters). Before adding a new ‘id’, make sure it is initially specified in one of the class or property
struct
input files -
Check entries for wrong or misspecified namespaces or prefixes. Make sure fragments end with the appropriate characters (‘#’ or ‘/’ if used in a URI construction)
-
Check columns where multiple entries may reside using the double-pipe (‘||’) convention, and ensure these decomposable strings are being constructed properly.
One of the reasons I am resistant to a complete build routine cascading through all of these steps at once is that problems in intermediate processing files propagate through all subsequent steps. That not only screws up much stuff, but it is harder to trace where the problem first arose. This is an instance where I prefer a ‘semi-automatic’ approach, with editorial inspection required between essential build steps.
Other Cleaning Routines
Fortunately, in our case, we are extracting fairly simple CSV files (though often with some long text entries for definitions) and ingesting in basically the same format. As long as we are attentive to how we modify the intermediate flat files, there is not too much further room for error.
However, there are many sources of external data that may eventually warrant incorporation in some manner into your knowledge graph. These external sources may pose a larger set of cleaning and wrangling challenges. Date and time formats, for example, can be particularly challenging.
Hadley Wickham, the noted R programmer and developer of many fine graphics programs, wrote a paper, Tidy Data, that is an excellent starting primer on wrangling flat files. In the case of our KBpedia knowledge graph and its supporting CSV, about the only guideline that he proposes that we consciously violate is to combine many-to-one data items sometimes in a single column (notable for altLabels
, but a few others as well). According to Wickham, we should put each individual value on its own row. I have not done so to keep the listings more compact and the row count manageable. Nonetheless, his general guidance is excellent. Another useful guide is Wrangling Messy CSV Files by Detecting Row and TypePatterns.
There are also many additional packages in Python that may assist in dealing with ‘dirty’ input files. Depending on the specific problems you may encounter, some quick Web searches should turn up some useful avenues to pursue.
Lastly, in both our utils.py
and other modules going forward, we will have occasion to develop some bespoke cleaning and formatting routines as our particular topic warrants.
Additional Documentation
Here is some additional documentation related to todays CWPK installment:
- Wikipedia URLs
- Processing text files with Python 3
- Unicode ‘how to’
- Convert between file encodings
- ftfy module documentation.
*.ipynb
file. It may take a bit of time for the interactive option to load.