Posted:February 22, 2021

New Hierarchy Paper Published

Discusses the Role of Order and Complexity in Knowledge Systems

I was very pleased when the editor of the ISKO Encyclopedia of Knowledge Organization (IEKO), Birger Hjørland, asked me to draft an article on ‘hierarchy’ for peer review. ISKO, the International Society of Knowledge Organization, is the leading international society to advance conceptual work in knowledge organization for all kinds and purposes, such as databases, libraries, dictionaries and the Internet.

Birger had seen my earlier treatment of the hierarchy concept in my book, and felt an expansion of what I had produced there could be of use to this community. I agreed subject to being able to frame the topic using the guidance of Charles Sanders Peirce, to which Birger readily assented. After considerable research and expansion, the draft received many useful comments during review. The resulting paper, with nearly 300 references, has now been published.

Hierarchies — real or artificial — abound to help us organize our world. A hierarchy places items into a general order, where more ‘general’ is also more ‘abstract’. The etymology of the word hierarchy is grounded in notions of religious and social rank. My article, after a broad historical review, focuses on knowledge systems, an interloper of the term hierarchy since at least the 1800s.

Hierarchies in knowledge systems include taxonomies, classification systems, or thesauri in library and information science, and systems for representing information and knowledge to computers, notably ontologies, knowledge graphs, and knowledge representation languages. Hierarchies are the logical underpinning of inference and reasoning in these systems, as well as the scaffolding for classification and inheritance.

Hierarchies in knowledge systems express subsumption relations that have many flexible variants, which we can represent algorithmically, and thus computationally. My article dissects the dimensions of that variability, leading to a proposed typology of hierarchies useful to knowledge systems. The article argues through a perspective informed by Peirce that natural hierarchies are real, can be logically determined, and are the appropriate basis for knowledge systems. Description logics and semantic language standards such as RDF or OWL reflect this perspective, importantly through their open-world logic and vocabularies for generalized subsumption hierarchies.

I conclude the paper by highlighting recent research that suggests possible mechanisms for the emergence of natural hierarchies. These involve the nexus of chance, evolution, entropy, free energy, and information theory.

You may read the open access paper at https://www.isko.org/cyclo/hierarchy.

Posted:November 14, 2017

Hierarchies in Knowledge Representation

Some Basic Use Cases from KBpedia

Download PDF

The human propensity to categorize is based on trying to make sense of the world. The act of categorization is based on how to group things together and how to relate those things and groups to one another. Categorization demands that we characterize or describe the things of the world using what we have termed attributes in order to find similarities [1]. Categorization may also be based on the relationships of things to external things [2]. No matter the method, the results of these categorizations tend to be hierarchical, reflective of what we see in the natural world. We see hierarchies in Nature based on bigger and more complex things being comprised of simpler things, based on fractals or cellular automata, or based on the evolutionary relationships of lifeforms. According to Annila and Kuismanen, “various evolutionary processes naturally emerge with hierarchical organization” [3]. Hierarchy, and its intimate relationship with categorization and categories, is thus fundamental to the why and how we can represent knowledge for computable means.

Depending on context, we can establish hierarchical relationships between types, classes or sets, with instances or individuals, with characteristics of those individuals, and between all of these concepts. There is potentially different terminology depending on context, and the terminology or syntax may also carry formal understanding of how we can process and compute these relationships. Nillson provides a general overview of these kinds of considerations with a useful set of references [4].

Types of Hierarchical Relationships

As early as 1997 Doyle noted in the first comprehensive study of KR languages, “Hierarchy is an important concept. It allows economy of description, economy of storage and manipulation of descriptions, economy of recognition, efficient planning strategies, and modularity in design.” He also noted that “hierarchy forms the backbone in many existing representation languages” [5].

The basic idea of a hierarchy is that some item (‘thing’) is subsidiary to another item. Categorization, expressed both through the categories themselves and the process of how one splits and grows categories, is a constant theme in knowledge representation. The idea of hierarchy is central to what is treated as a category or other such groupings and how those categories or groupings are tied together. A hierarchical relationship is shown diagrammatically in Figure 1 with A or B, the ‘things’, shown as nodes.

Direct Hierarchy

Figure 1: Direct Hierarchy

All this diagram is really saying is that A has some form of superior or superordinate relationship to B (or vice versa, that B is subordinate to A). This is a direct hierarchical relationship, but one of unknown character. Hierarchies can also relate more than two items:

Simple Hierarchy

Figure 2: Simple Hierarchy

In this case, the labels of the items may seem to indicate the hierarchical relationship, but relying on labels is wrong. For example, let’s take this relationship, where our intent is to show the mixed nature of primary and secondary colors [6]:

Multi-hierarchy

Figure 3: Multiple Hierarchy

Yet perhaps our intent was rather to provide a category for all colors to be lumped together, as instances of the concept ‘color’ shows here:

Extensional Hierarchy

Figure 4: Extensional Hierarchy

The point is not to focus on colors – which are, apparently, more complicated to model than first blush – but to understand that hierarchical relations are of many types and what one chooses about a relation carries with it logical implications, the logic determined by the semantics of the representation language chosen and how we represent it. For this clarity we need to explicitly define the nature of the hierarchical relationship. Here are some (vernacular) examples one might encounter:

A	subsumes	B
A	is more basic than	B
A	is a superClassOf	B
A	is more fundamental than	B
A	is broader than	B
A	includes	B
A	is more general	B
B	is-a	A
A	is parent of	B
A	has member	B
A	has an instance of	B
A	has attribute	B
A	has part	B

Table 1: Example Hierarchical Relationships

Again, though we have now labeled the relationships, which in a graph representation are the edges between the nodes, it is still unclear the populations to which these relations may apply and what their exact semantic relationships may be.

Table 2 shows the basic hierarchical relations that one might want to model, and whether the item resides in the universal categories of Charles Sanders Peirce of Firstness, Secondness or Thirdness, introduced in one of my previous articles [7]:

Firstness		Secondness		Thirdness
attribute	―	token (instance)
		sibling
		\|
		sibling
		child	―	parent
		\|
		parent
		token	―	type
		part	―	whole
		\|
		whole
		sub	―	super
				\|
				sub

Table 2: Possible Pairwise (―) Hierarchical Relationships

Note that, depending on context, some of the items may reside in either Secondness or Thirdness (depending on whether the referent is a particular instance or a general). Also note the familial relationships shown: child-parent-grandparent and child-child relationships occur in actual families and as a way of talking about inheritance or relatedness relations. The idea of type or is-a is another prominent one in ontologies and knowledge graphs. Natural classes or kinds, for example, fall into the type-token relationship. Also note that mereological relationships, such as part-whole, may also leave open ambiguities. We also see certain pairs, such a sub-super, child-parent, or part-whole, need context to resolve the universal category relation.

Reliance on item labels alone for the edges and nodes, even for something as seemingly straightforward as color or pairwise relationships, does not give us sufficient information to determine how to evaluate the relationship nor how to properly organize. We thus see in knowledge representation that we need to express our relationships explicitly. Labels are merely assigned names that, alone, do not specify the logic to be applied, what populations are affected, or even the exact nature of the relationship. Without these basics, our knowledge graphs can not be computable. Yet well over 95% of the assignments in contemporary knowledge bases have this item-item character. We need interpretable relationships to describe the things that populate our domains of inquiry so as to categorize that world into bite-sized chunks.

Salthe categorizes hierarchies into two types: compositional hierarchies and subsumption hierarchies [8]. Mereological and part-whole hierarchies are compositional, as are entity-attribute. Subsumption hierarchies are ones of broader than, familial, or evolutionary. Cottam et al. believe hierarchies to be so basically important as to propose a model abstraction over all hierarchical types, including levels of abstraction [9]. These discussions of structure and organization are helpful to understand the epistemological bases underlying various kinds of hierarchy. We should also not neglect recursive hierarchies, such as fractals or cellular automata, which are also simple, repeated structures commonly found in Nature. Fortunately, Peirce’s universal categories provide a powerful and consistent basis for us to characterize these variations. When paired with logic and KR languages and “cutting Nature at its joints” [10], we end up with an expressive grammar for capturing all kinds of internal and external relations to other things.

So far we have learned that most relationships in contemporary knowledge bases are of a noun-noun or noun-adjective nature, which I have loosely lumped together as hierarchical relationships. These relationships span from attributes to instances (individuals) and classes [11] or types, with and between one another. We have further seen that labels either for the subjects (nodes) or for their relationships (edges) are an insufficient basis for computers (or us!) to reason over. We need to ground our relationships in specific semantics and logics in order for them to be unambiguous to reasoning machines.

Structures Arising from Hierarchies

Structure needs to be a tangible part of thinking about a new KR installation, since many analytic choices need to be supported by the knowledge artifact. Different kinds of structure are best for different tools or kinds of analysis. The types of relations chosen for the artifact affects its structural aspects. These structures can be as simple and small as a few members in a list, to the entire knowledge graph fully linked to its internal and external knowledge sources. Here are some of the prominent types of structures that may arise from connectedness and characterization hierarchies:

Lists — unordered members or instances, with or without gaps or duplicates, useful for bulk assignment purposes. Lists generally occur through a direct relation assignment (e.g., rdf:Bag)
Neural networks (graphs) — graph designs based on connections modeled on biological neurons, still in the earliest stages with respect to relations and KR formalisms [12]
Ontologies (graphs) — sometimes ontologies are treated as synonymous with knowledge graphs, but more often as a superset that may allow more control and semantic representation [13] Ontologies are a central design feature of KBpedia [14]
Parts-of-speech — a properly designed ontology has the potential to organize the vocabulary of the KR language itself into corresponding parts-of-speech, which greatly aids natural language processing
Sequences — ordered members or instances, with or without gaps or duplicates, useful for bulk assignment purposes. Sequences generally occur through a direct relation assignment (e.g., rdf:Seq)
Taxonomies (trees)— trees are subsumption hierarchies with single (instances may be assigned to only one class) or multiple (instances may be assigned to multiple classes or types) inheritance. The latter is the common scaffolding for most knowledge graphs
Typologies — are essentially multi-inheritance taxonomies, with the hierarchical organization of types as natural as possible. Natural types (classes or kinds) enable the greatest number of disjoint assertions to be made, leading to efficient processing and modular design. Typologies are a central design feature of KBpedia; see further [15].

Typically KR formalisms and their internal ontologies (taxonomy or graph structures) have a starting node or root, often called ‘thing’, ‘entity’ or the like. Close inspection of the choice of root may offer important insights. ‘Entity’, for example, is not compatible with a Peircean interpretation, since all entities are within Secondness.

KBpedia’s foundational structure is the subsumption hierarchy shown in the KBpedia Knowledge Ontology (KKO) — that is, KBpedia’s upper ontology — and its nodes derived from the universal categories. The terminal, or leaf, nodes in KKO each tie into typologies. All of the typologies are themselves composed of types, which are the hierarchical classification of natural kinds of instances as determined by shared attributes (though not necessarily the same values for those attributes). Most of the types in KBpedia are composed of entities, but attributes and relations also have aggregations of types.

Of course, choice of a KR formalism and what structures it allows must serve many purposes. Knowledge extension and maintenance, record design, querying, reasoning, graph analysis, logic and consistency tests, planning, hypothesis generation, question and answering, and subset selections for external analysis are properly the purview of the KR formalism and its knowledge graph. Yet other tasks such as machine learning, natural language processing, data wrangling, statistical and probabalistic analysis, search indexes, and other data- and algorithm-intensive applications are often best supported by dedicated external applications. The structures to support these kinds of applications, or the ability to export them, must be built into the KR installation, with explicit consideration for the data forms and streams useful to possible third-party applications.

[1] The most common analogous terms to attributes are properties or characteristics; in the OWL language used by KBpedia, attributes are assigned to instances (called individuals) via property (relation) declarations.

[2] The act of categorization may thus involve intrinsic factors or external relationships, with the corresponding logics being either intensional or extensional.

[3] Arto Annila and Esa Kuismanen. 2009. “Natural Hierarchy Emerges from Energy Dispersal”. Biosystems 95, 3: 227–233. https://doi.org/10.1016/j.biosystems.2008.10.008

[4] Jørgen Fischer Nilsson. 2006. “Ontological Constitutions for Classes and Properties”. In Conceptual Structures: Inspiration and Application (Lecture Notes in Computer Science), 35–53.

[5] Jon Doyle. 1977. Hierarchy in Knowledge Representations. MIT Artificial Intelligence Laboratory. Retrieved October 24, 2017 from http://dspace.mit.edu/handle/1721.1/41988

[6] The first and more standard 3-color scheme was first explicated by J W von Goethe (1749-1832). What is actually more commonly used in design is a 4-color scheme from Ewald Hering (1834-1918).

[7] Michael K. Bergman. 2016. “A Foundational Mindset: Firstness, Secondness, Thirdness”. AI3:::Adaptive Information. Retrieved September 18, 2017 from https://www.mkbergman.com/1932/a-foundational-mindset-firstness-secondness-thirdness/

[8] Stanley Salthe. 2012. Hierarchical Structures. https://doi.org/10.1007/s10516-012-9185-0<

[9] Ron Cottam, Willy Ranson, and Roger Vounckx. 2016. “Hierarchy and the Nature of Information”. Information 7, 1: 1. https://doi.org/10.3390/info7010001

[10] Plato. “Phaedrus Dialog (page 265e)”. Perseus Digital Library. Retrieved November 11, 2017 from http://www.perseus.tufts.edu/hopper/text?doc=Perseus%3Atext%3A1999.01.0174%3Atext%3DPhaedrus%3Apage%3D265

[11] In the OWL 2 language used by KBpedia, a class is any arbitrary collection of objects. A class may contain any number of instances (called individuals) or a class may be a subclass of another. Instances and subclasses may belong to none, one or more classes. Both extension and intension may be used to assign instances to classes.

[12] Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. 2017. “A Simple Neural Network Module for Relational Reasoning”. arXiv:1706.01427 [cs]. Retrieved November 1, 2017 from http://arxiv.org/abs/1706.01427

[13] RDF graphs are more akin to the first sense; OWL 2 graphs more to the latter.

[14] In the semantic Web space, “ontology” was the original term because of the interest to capture the nature or being (Greek ὄντως, or ontós) of the knowledge domain at hand. Because the word ‘ontology’ is a bit intimidating, a better variant has proven to be the knowledge graph (because all semantic ontologies take the structural form of a graph).

[15] Michael K. Bergman. 2016. “Rationales for Typology Designs in Knowledge Bases”. AI3:::Adaptive Information. Retrieved September 18, 2017 from https://www.mkbergman.com/1952/rationales-for-typology-designs-in-knowledge-bases/

Posted:September 12, 2016

The Importance of Being Peirce

His Triadic Logic is the Mindset for Categorization

Many of us involved in semantic technologies or information science grapple with the question of categorization. How do we provide a coherent organization of the world that makes sense? Better still, how might we represent this coherent structure in a manner that informs how we can extend or grow our knowledge domains? Most problems of a practical nature require being able to combine information together so as to inform new knowledge. Categories that bring together (generalize) similar things are a key way to aid that.

Embracing semantic technologies means, among standards and other things, that the natural structural representation of domains is the graph. These are formally specified using either RDF or OWL. These ontologies have objects as nodes, and properties between those nodes as edges. I believe in this model, and have worked for at least a decade to promote its use. It is the model used by Google’s knowledge graph, for example.

Knowledge graphs that are upper ontologies typically have 80% to 85% of their nodes acting to group similar objects, mostly what could be axiomatized as ‘classes’ or ‘types’. This realization naturally shifts focus to, then, how are these groups formed? What are the bases to place multiple instances into a given class? Are types the same things as classes?

Knowledge, inherently open and dynamic, can only be used for artificial intelligence when it is represented by structures readable by machines. Digitally readable structures of knowledge and features are essential for machine learning, natural language understanding, or other AI functions. Indeed, were such structures able to be expressed in a mostly automatic way, the costs and efforts to perform AI and natural language processing and understanding functions (NLP and NLU) would be greatly lessened.

Open and dynamic also means that keeping the knowledge base current requires simple principles to educate and train those charged with keeping the structure up to date. Nothing is perfect, humans or AI. Discovery and truth only result from questioning and inspection. The entire knowledge graph is fallible and subject to growth and revision. Human editors — trained and capable — are essential to maintain the integrity of such structures, automation or AI not withstanding. Fundamentally, then, the challenge becomes how to think simply about grouping things and forming categories. Discovery of simplicity is hard without generalization and deep thought.

A Peircean View in Thirdness

Scholars of Charles Sanders Peirce (“purse”) (1839 – 1913) [1] all acknowledge how infused his writings on logic, semiosis, philosophy, and knowledge are with the idea of “threes”. His insights are perhaps most studied with respect to his semiosis of signs, with the triad formed by object, representation, and interpretation. But Peirce recognized many prior philosophers, particularly Kant and Hegel, had also made “threes” a cornerstone of their views. Peirce studied and wrote on what makes “threes” essential and irreducible. His generalization, or abstraction if you will, he called simply the Three Categories, and to reflect their fundamental nature, called each separately as Firstness, Secondness and Thirdness. In his writings over decades, he related or described this trichotomy in dozens of contexts [2].

Across his voluminous writings, which unfortunately are not all available since they are still being transcribed from tens of thousands of original handwritten notes, I glean from the available materials this understanding of his three categories from a knowledge representation standpoint:

Firstness [1ns] — these are potentials, the basic qualities that may combine together or interact in various ways to enable the real things we perceive in the world. They are unexpressed potentialities, the substrate of the actual. These are the unrealized building blocks, or primitives, the essences or attributes or possible juxtapositions; indeed, “these” and “they” are misnomers because, once conceived, the elements of Firstness are no longer Firstness;
Secondness [2ns] — these are the particular realized things or concepts in the world, what we can perceive, point to and describe (including the idea of Firstness, Thirdness, etc.) A particular is also known as an entity, instance or individual;
Thirdness [3ns] — these are the laws, habits, regularities and continuities that may be generalized from particulars. All generals — what are also known as classes, kinds or types — belong to this category. The process of finding and deriving these generalities also leads to new insights or emergent properties, what Peirce called the “surprising fact.”

Understanding, inquiry and knowledge require this irreducible structure; connections, meaning and communication depend on all three components, standing in relation to one another and subject to interpretation by multiple agents (Peirce’s semiosis of signs). Contrast this Peircean view with traditional classification schemes, which have a dyadic or dichotomous nature and do not support such rich views of context and interpretation.

Peirce’s “surprising fact” is new knowledge that emerges from anomalies observed when attempting to generalize or to form habits. Abductive reasoning, a major contribution by Peirce, attempts to probe why the anomaly occurs. The possible hypotheses so formed constitute the Firstness or potentials of a new categorization (identification of particulars and generalization of the phenomena). The scientific method is grounded in this process and reflects the ideal of this approach (what Peirce called the “methodeutic”).

Peirce at a High Altitude

Significant terms we associate with knowledge and its discovery include open, dynamic, process, representation, signification, interpretation, logic, coherence, context, reality, and truth. These were all topics of Peirce’s deep inquiry and explained by him via his triadic world view. For example, Peirce believed in the real as having existence apart from the mind (a refutation of Descartes’ view). He believed there is truth, that it can be increasingly revealed by the scientific method and social consensus (agreement of signs), but current belief as to what is “truth” is fallible and can never be realized in the absolute (it is a limit function). There is always distance and different interpretation between the object, its representation, and its interpretation. But this same logic provides the explanation for the process of categorization, also grounded in Firstness, Secondness and Thirdness [2].

Of course, some Peircean scholars may rightfully see these explanations as a bit of a cartoon, and a possible injustice to his lifetime of work. For more than 100 years philosophers and logicians have tried to plumb Peirce’s insights and writings. This summary by no means captures many subtleties. But, if we ourselves generalize across Peirce’s writings and his application of the Three Categories, we can gain a mindset that, I submit, is both easily grasped and applied, the result of which is a logical, coherent approach to categorization and knowledge representation.

First, we decide the focus of the categorization effort. That may arise from one of three sources. We either are trying to organize a knowledge domain anew; we are splitting an existing category that has become too crowded and difficult to reason over; or we have found a “surprising fact” or are trying to plumb an anomaly. Any of these can trigger the categorization process (and, notice, they are in 1ns, 2ns and 3ns splits). The breadth or scope of the category is based on the domain and the basis of the categorization effort.

How to think about the new category and decide its structure comes from the triad:

Firstness – the potential things, ideas, concepts, entities, forces, factors, events, whatever that potentially bear upon or have relevance to the category; think of it as the universe of thought that might be brought to bear for the new category of inquiry
Secondness – the particular instances, real and imagined, that may populate the information space for the category, including the ideas of attributes and relations, which also need to be part of the Firstness
Thirdness – the generals, types, regularities, patterns, or logical groupings that may arise from combinations of any of these factors. Similarities, “truth” and predictability help inform these groupings.

What constitutes the potentials, realized particulars, and generalizations that may be drawn from a query or investigation is contextual in nature. I outlined more of the categorization process in an earlier article [2].

Peirce’s triadic logic is a powerful mindset for how to think about and organize the things and ideas in our world. Peirce’s triadic logic and views on categorization are fractal in nature. We can apply this triadic logic to any level of information granularity. The graph structure arises from the connections amongst all of these 1ns, 2ns and 3ns factors.

We will be talking further how this 40,000 ft view of the Peircean mindset helps create practical knowledge graphs and ontological structures. We will also be showing an example suitable for knowledge-based artificial intelligence (KBAI). The exciting point is that we have found a simple grounding of three aspects that is logically sound and can be readily trained. We also will be showing how we can do so much more work against this kind of natural KBAI structure.

Stay tuned.

[1] A tremendous starting point for information on Peirce is the category about him on Wikipdia, starting with his eponymous page.

[2] M.K. Bergman, 2016. “A Foundational Mindset: Firstness, Secondness, Thirdness,” AI3:::Adaptive Information blog, March 21, 2016.

Posted:June 27, 2016

Literate Programming for an Open World

Constant Change Makes Adopting New Practices and Learning New Tools Essential

My business partner, Fred Giasson, has an uncanny ability to sense where the puck is heading. Perhaps that is due to his French Canadian heritage and his love of hockey. I suspect rather it is due to his having his finger firmly on the pulse.

When I first met Fred ten years ago he was already a seasoned veteran of the early semantic Web, with new innovative services at the time such as PingTheSemanticWeb under his belt. More recently, as we at Structured Dynamics began broadening from semantic technologies to more general artificial intelligence, Fred did his patient, quiet research and picked Clojure as our new official development language. We had much invested in our prior code base; switching main programming languages is always one of the most serious decisions a software shop can make. The choice has been brilliant, and our productivity has risen substantially. We are also able to exploit fundamentally new potentials based on a functional programming language that runs in the Java VM and has intellectual roots in Lisp.

As our work continues to shift more to knowledge bases and their use for mapping, classification, tagging, learning, etc., our challenges have been of a still different nature. Knowledge bases used in this manner are not only inherently open world because of the changing knowledge base, but because in staging them for machine learners and training sets, test, build and maintenance scripts and steps are constantly changing. Dealing with knowledge management brings substantial technical debt, and systems and procedures must be in place to deal with that. Literate programming is one means to help.

Because of Clojure and its REPL abilities that enable code to be interpreted and run dynamically at time of input, we also have been looking seriously at the notebook paradigm that has come out of interactive science lab books, and now has such exemplar programs such as iPython Notebook (Jupyter), Org-mode, Wolfram Alpha, Zeppelin, Gorilla, and others [1]. Fred had been interested in and poking at literate programming for quite a while, but his testing and use of Org-mode to keep track of our constant tests and revisions led him to take the question more seriously. You can see this example, a more-detailed example, or still another example of literate programming from Fred’s blog.

Literate programming is even a greater change than switching programming languages. To do literate programming right, there needs to be a focused commitment. To do literate programming right, workflows need to change and new tools must be learned. Is the effort worth it?

Literate Programming

Literate programming is a style of writing code and documentation first proposed by Donald Knuth. As any aspect of a coding effort is written — including its tests, configurations, installation, deployment, maintenance or experiments — written narrative and documentation accompanies it, explaining what it is, the logic of it, and what it is doing and how to exercise it. This documentation far exceeds the best-practices of in-line code commenting.

Literate programming narratives might provide background and thinking about what is being tested or tried, design objectives, workflow steps, recipes or whatever. The style and scope of documentation are similar to what might be expected in a scientist’s or inventor’s lab notebook. Indeed, the breed of emerging electronic notebooks, combined with REPL coding approaches, now enable interactive execution of functions and visualization and manipulation of results, including supporting macros.

Systems that support literate programming, such as Org-mode, can “tangle” their documents to extract the code portions for compilation and execution. They can also “weave” their documents to extract all of the documentation formatted for human readability, including using HTML. Some electronic systems can process multiple programming languages and translate functions. Some electronic systems have built-in spreadsheets and graphing libraries, and most open-source systems can be extended (though with varying degrees of difficulty and in different languages). Some of the systems are designed to interact with or publish Web pages.

Code and programs do not reside in isolation. Their operation needs to be explained and understood by others for bug fixing, use or maintenance. If they are processing systems, there are parameters and input data that are the result of much testing and refinement; they may need to be refined further again. Systems must be installed and deployed. Libraries and languages are frequently being updated for security and performance reasons; executables and environments need to be updated as well. When systems are updated, there are tests that need to be run to check for continued expected performance and accuracy. The severity of some updates may require revision to whole portions of the underlying systems. New employees need tech transfer and training and managers need to know how to take responsibility for the systems. These are all needs that literate programming can help support.

One may argue that transaction systems in more stable environments may have a lesser requirement for literate programming. But, in any knowledge-intensive or knowledge management scenario. the inherent open world nature of knowledge makes something like literate programming an imperative. Everything is in constant flux with a positive need for ongoing updates.

The objective of programmers should not be solely to write code, but to write systems that can be used and re-used to meet desired purposes at acceptable cost. Documentation is integral to that objective. Early experiments need to be improved, codified and documented such that they can be re-implemented across time and environment. Any revision of code needs to be accompanied by a revision or update to documentation. A lines-of-code (LOC) mentality is counter-productive to effective software for knowledge purposes. Literate programming is meant to be the workflow most conducive to achieve these ends.

The Nature of Knowledge

For quite some time now I have made the repeated argument that the nature of knowledge and knowledge management functions compel an emphasis on the open world assumption (OWA) [2]. Though it is a granddaddy of knowledge bases, let’s take the English Wikipedia as an example of why literate programming makes sense for knowledge management purposes. Let’s first look at the nature of Wikipedia itself, and then look at (next section) the various ways it must be processed for KBAI (knowledge-based artificial intelligence) purposes.

The nature of knowledge is that it is constantly changing. We learn new things, understand more about existing things, see relations and connections between things, and find knowledge in other arenas that causes us to re-think what we already knew. Such changes are the definition of an “open world” and pose major challenges to keeping information and systems up to date as these changes constantly flow in the background.

We can illustrate these points by looking at the changes in Wikipedia over a 20-month period from October 2012 to June 2014 [3]. Though growth of Wikipedia has been slowing somewhat since its peak growth years, the kinds of changes seen over this period are fairly indicative of the qualitative and structural changes that constantly affect knowledge.

First, the number of articles in the English Wikipedia (the largest, but only one of the 200+ Wikipedia language versions) increased 12% to over 4.6 million articles over the 20-month period, or greater than 0.6% per month. Actual article changes were greater than this amount. Total churn over the period was about 15.3%, with 13.8% totally new articles and 1.5% deleted articles [3].

Second, even greater changes occurred in Wikipedia’s category structure. More than 25% of the categories were net additions over this period. Another 4% were deleted [3]. Fairly significant categorical changes continue because of a concerted effort by the project to address category problems in the system.

And, third, edits of individual articles continued apace. Over this same period, more than 65 million edits were made to existing English articles, or about 0.75 edits per article per month [3]. Many of these changes, such as link or category or infobox assignments, affect the attributes or characteristics of the article subject, which has a direct effect on KBAI efforts. Also, even text changes affect many of the NLP-based methods for analyzing the knowledge base.

Granted, Wikipedia is perhaps an extreme example given its size and prominence. But the kinds of qualitative and substantive changes we see — new information, deletion of old information, adding or changing specifics to existing information, or changing how information is connected and organized — are common occurrences in any knowledge base.

The implications of working with knowledge bases are clear. KBs are constantly in flux. Single-event, static processing is dated as soon as the procedures are run. The only way to manage and use KB information comes from a commitment to constant processing and updates. Further, with each processing event, more is learned about the nature of the underlying information that causes the processing scripts and methods to need tweaking and refinement. Without detailed documentation of what has been done with prior processing and why, it is impossible to know how to best tweak next steps or to avoid dead-ends or mistakes of the past. KBAI processing can not be cost-effective and responsive without a memory. Literate programming, properly done, provides just that.

The Nature of Systems to Manage Knowledge

Of course, KBAI may also involve multiple input sources, all moving at different speeds of change. There are also multiple steps involved in processing and updating the input information, the “systems”, if you will, required to stage and use the information for artificial intelligence purposes. The artifacts associated with these activities range from functional code and code scripts; to parameter, configuration and build files; to the documentation of those files and scripts; to the logic of the systems; to the process and steps followed to achieve desired results; and to the documentation of the tests and alternatives investigated at any stage in the process. The kicker is that all of these components, without a systematic approach, will need updates, and conventional (non-literate) coding approaches will not be remembered easily, causing costly re-discovery and re-work.

We have tallied up at least ten major steps associated with a processing pipeline for KBAI purposes. I briefly describe each below so to better gain a flavor of this overall flux that needs to be captured by literate programming.

1. Updating Changing Knowledge

The section above dealt with this step, which is ensuring that the input knowledge bases to the overall KBAI process are current and accurate. Depending on the nature of the KM system, there may be multiple input KBs involved, each demanding updates. Besides capturing the changes in the base information itself, many of the steps below may also be required to properly process this changing input knowledge.

2. Processing Input KBs

For KBAI purposes, input KBs must be processed so as to be machine readable. Processing is also desirable to expose features for machine learners and to do other clean up of the input sources, such as removal of administrative categories and articles, cleaning up category structures, consolidating similar or duplicative inputs into canonical forms, and the like.

3. Installing, Running and Updating the System

The KBs themselves require host databases or triple stores. Each of the processing steps may have functional code or scripts associated with it. All general management systems need to be installed, kept current, and secured. The management of system infrastructure sometimes requires a staff of its own, let alone install, deploy, monitoring and update systems.

4. Testing and Vetting Placements

New entities and types added to the knowledge base need to be placed into the overall knowledge graph and tested for logical placement and connections. Though final placements should be manually verified, the sheer number of concepts in the system places a premium on semi-automatic tests and placements. Placement metrics are also important to help screen candidates.

5. Testing and Vetting Mappings

One key aspect of KBAI is its use in interoperating with other datasets and knowledge bases. As a result, new or updated concepts in the KB need to be tested and mapped with appropriate mapping predicates to external or supporting KBs. In the case of UMBEL, Structured Dynamics routinely attempts to map all concepts to Wikipedia (DBpedia), Cyc and Wikidata. Any changes to the base KB causes all of these mappings to be re-investigated and confirmed.

6. Testing and Vetting Assertions

Testing does not end with placements and mappings. Concepts are often characterized by attributes and values; they are often given internal assignments such as SuperTypes; and all of these assertions must be tested against what already exists in the KB. Though the tests may individually be fairly straightforward, there are thousands to test and cross-consistency is important. Each of these assertions is subject to unit tests.

7. Ensuring Completeness

As we have noted elsewhere, our KBAI best practices call for each new concept in the KB to be accompanied by a definition, complete characterization and connections, and synonyms or synsets to aid in NLP tasks. These requirements, too, require scripts and systems for completion.

8. Testing and Vetting Coherence

As the broader structure is built and extended, system tests are applied to ensure the overall graph remains coherent and that outliers, orphans and fragments are addressed. Some of this testing is done via component typologies, and some occurs using various network and graph analyses. Possible problems need to be flagged and presented for manual inspection. Like other manual vetting requirements, confidence scoring and ranking of problems and candidates speed up the screening process.

9. Generating Training Sets

A key objective of the KBAI approach is the creating of positive and negative training sets. Candidates need to be generated; they need to be scored and tested; and their final acceptance needs to be vetted. Once vetted, the training sets themselves may need to be expressed in different formats or structures (such as finite state transducers, one of the techniques we often use) in order for them to be performant in actual analysis or use.

10. Testing and Vetting Learners

Machine learners can then be applied to the various features and training sets produced by the system. Each learning application involves the testing of one or more learners; the varying of input feature or training sets; and the testing of various processing thresholds and parameters (including possibly white and black lists). This set of requirements is one of the most intensive on this listing, and definitely requires documentation of test results, alternatives tested, and other observations useful to cost-effective application.

Rinse and Repeat

Each of these 10 steps is not a static event. Rather, given the constant change inherent to knowledge sources, the entire workflow must be repeated on a periodic basis. In order to reduce the tension between updating effort and current accuracy, the greater automation of steps with complete documentation is essential. A lack of automation leads to outdated systems because of the effort and delays in updates. The imperative for automation, then, is a function of the change frequency in the input KBs.

KBAI, perhaps at the pinnacle of knowledge management services, requires more of these steps and perhaps more frequent updates. But any knowledge management activity will incur a portion of these management requirements.

Yes, Literate Programming is Worth It

As I stated in a prior article in this series [4], “The only sane way to tackle knowledge bases at these structural levels is to seek consistent design patterns that are easier to test, maintain and update. Open world systems must embrace repeatable and largely automated workflow processes, plus a commitment to timely updates, to deal with the constant, underlying change in knowledge.” Literate programming is, we have come to believe, one of the integral ways to keep sane.

The effort to adopt literate programming is justified. But, as Fred noted in one of his recent posts, literate programming does impose a cost on teams and requires some cultural and mindset changes. However, in the context of KBAI, these are not simply nice-to-haves, they are imperatives.

Choice of tools and systems thus becomes important in supporting a literate programming environment. As Fred has noted, he has chosen Org-mode for Structured Dynamics’ literate programming efforts. Besides Org-mode, that also (generally) requires adoption by programmers of the Emacs editor. Both of these tools are a bit problematic for non-programmers.

Though no literate programming tools yet support WOPE (write once, publish everywhere), they can make much progress toward that goal. By “weaving” we can create standalone documentation. With the converter tool Pandoc, we can make (mostly) accurate translations of documents in dozens of formats against one another. The system is open and can be extended. Pandoc works best with lightweight markup formats like Org-mode, Markdown, wikitext, Textile, and others.

We’re still working hard on the tooling infrastructure surrounding literate programming. We like the interactive notebooks approach, and also want easy and straightforward ways to deploy code snippets, demos and interactive Web pages.

Because of the factors outlined in this article, we see a renewed emphasis on literate programming. That, combined with the Web and its continued innovations, would appear to point to a future rich in responsive tooling and workflows geared to the knowledge economy.

[1] Other known open source electronic lab notebook options include Beaker Notebook, Flow, nteract, OpenWetWare, Pineapple, Rodeo, RStudio, SageMathCloud, Session, Shiny, Spark Notebook, and Spyder, among others certainly missed. Terminology for these apps includes notebook, electronic lab notebook, data notebook, and data scientist notebook.

[2] See M. K. Bergman, 2009. “The Open World Assumption: Elephant in the Room,” AI3:::Adaptive Information blog, December 21, 2009. The open world assumption (OWA) generally asserts that the lack of a given assertion or fact being available does not imply whether that possible assertion is true or false: it simply is not known. In other words, lack of knowledge does not imply falsity. Another way to say it is that everything is permitted until it is prohibited. OWA lends itself to incremental and incomplete approaches to various modeling problems.

OWA is a formal logic assumption that the truth-value of a statement is independent of whether or not it is known by any single observer or agent to be true. OWA is used in knowledge representation to codify the informal notion that in general no single agent or observer has complete knowledge, and therefore cannot make the closed world assumption. The OWA limits the kinds of inference and deductions an agent can make to those that follow from statements that are known to the agent to be true. OWA is useful when we represent knowledge within a system as we discover it, and where we cannot guarantee that we have discovered or will discover complete information. In the OWA, statements about knowledge that are not included in or inferred from the knowledge explicitly recorded in the system may be considered unknown, rather than wrong or false. Semantic Web languages such as OWL make the open world assumption.

Also, you can search on OWA on this blog.

[3] See Ramakrishna B. Bairi, Mark Carman and Ganesh Ramakrishnan, 2015. “On the Evolution of Wikipedia: Dynamics of Categories and Articles,” Wikipedia, a Social Pedia: Research Challenges and Opportunities: Papers from the 2015 ICWSM Workshop; also, see https://stats.wikimedia.org/EN/TablesWikipediaEN.htm

[4] M.K. Bergman, 2016. “Rationales for Typology Designs in Knowledge Bases,” AI3:::Adaptive Information blog, June 6, 2016.

Posted:March 28, 2016

Withstanding the Test of Time

Long-lost Global Warming Paper is Still Pretty Good

My first professional job was being assistant director and then project director for a fifty-year look at the future of coal use by the US Environmental Protection Agency. The effort, called the Coal Technology Assessment (CTA), was started under the Carter Administration in the late 1970s, and then completed after Reagan took office in 1981. That era also spawned the Congressional Office of Technology Assessment. Trying to understand and forecast technological change was a big deal at that time.

We produced many, many reports from the CTA program, some of which were never published because of politics and whether they were at odds or not with official policies of one or the other administration. Nonetheless, we did publish quite a few reports. Perhaps it is the sweetness of memory, but I also recollect we did a pretty good job. Now that more than 35 years have passed, it is possible to see whether we did a good job or not in our half-century forecasts.

The CTA program was the first to publish an official position of EPA on global warming [1], which we also backed up with a more formal academic paper [2]. I have thought much of that paper on occasion over the years, but I did not have a copy myself and only had a memory, but not hard copy, of the paper.

Last week, however, I was contacted by a post-doctoral researcher in Europe trying to track down early findings and recollections of some of the earliest efforts on global climate change. She had a copy of our early paper and was kind enough to send me a copy. I have since been able to find other copies online [2].

In reading over the paper again, I am struck by two things. First, the paper is pretty good, and still captures (IMO) the uncertainty of the science and how to conduct meaningful policy in the face of that uncertainty. And, second, but less positive, is the sense of how little truly has gotten done in the intervening decades. This same sense of déjà vu all over again applies to many of the advanced energy technologies — such as fuel cells, photovoltaics, and passive solar construction — we were touting at that time.

Of course, my own career has moved substantially from energy technologies and policy to a different one of knowledge representation and artificial intelligence. But, it is kind of cool to look back on the passions of youth, and to see that my efforts were not totally silly. It is also kind of depressing to see how little has really changed in nearly four decades.

[1] M.K. Bergman, 1980. “Atmospheric Pollution: Carbon Dioxide,” Environmental Outlook — 1980, Strategic Analysis Group, U.S. Environmental Protection Agency, EPA 600/8 80 003, July 1980, pp. 225-261.

[1] Kan Chen, Richard C. Winter, and Michael K. Bergman, 1980. “Carbon dioxide from fossil fuels: Adapting to uncertainty.” Energy Policy 8, no. 4 (1980): 318-330.

Main Links

Search

Adaptive Information