Knowledge Graphs Deserve Attention in Their Own Right
We first introduced NetworkX in installment CWPK #56 of our Cooking with Python and KBpedia series. The purpose of NetworkX in that installment was to stage data for graph visualizations. In today’s installment, we look at the other side of the NetworkX coin; that is, as a graph analytics capability. We will also discuss NetworkX in relation to staging data for machine learning.
The idea of graphs or networks is at the center of the concept of knowledge graphs. Graphs are unique information artifacts that can be analyzed in their own right as well as being foundations for many unique analytical techniques, including for machine learning and its deep learning subset. Still, graphs as conceptual and mathematical structures are of relatively recent vintage. For example, the field known as graph theory is less than 300 years old. I outlined much of the intellectual history of graphs and their role in analysis in a 2012 article, The Age of the Graph.
Graph or network analysis has three principal aspects. The first aspect is to analyze the nature of the graph itself, with its connections, topologies and paths. The second is to use the structural aspects of the graph representation in order to conduct unique analyses. Some of these analyses relate to community or influence or relatedness. The third aspect is to use various or all aspects of the graph representation of the domain to provide, through dimensionality reduction, tractable and feature-rich methods for analyzing or conducting data science work useful to the domain. We’ll briefly cover the first two aspects in this installment. The remaining installments in this Part VI relate more to the third aspect of graph and deep graph representation.
Initial Setup
We will pick up with our NetworkX work from CWPK #56 to start this installment. (See the concluding sections below if you have not already generated the graph_specs.csv
file.)
Since I have been away from the code for a bit, I first decide to make sure my Python packages are up-to-date by running this standard command:
>>>conda update --all
Then, we invoke our standard start-up routine:
from cowpoke.__main__ import *
from cowpoke.config import *
from owlready2 import *
We then want to bring NetworkX into our workspace, along with pandas for data management. The routine we are going to write will read our earlier graph_specs.csv
file using pandas
. We will use this specification to create a networkx
representation of the KBpedia structure, and then begin reporting on some basic graph stats (which will take a few seconds to run):
import networkx as nx
import pandas as pd
= pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv')
df = nx.DiGraph()
Graphtype = nx.from_pandas_edgelist(df, edge_attr='weight', create_using=Graphtype)
G print('Graph construction complete.')
# Print the number of nodes in the graph
print('Number of Nodes:', len(G.nodes()))
#
print('Edges:', G.edges('Mammal'))
#
# Get the subgraph of all nodes around node
= [ n[1] for n in G.edges('Mammal') ]
sub # Actually, need to add the 'marge' node too
'Mammal')
sub.append(#
# Now create a new graph, which is the subgraph
= nx.Graph(G.subgraph(sub))
sg #
# Print the nodes of the new subgraph and edges
print('Subgraph nodes:', sg.nodes())
print('Subgraph edges:', sg.edges())
#
#
# Print basic graph info
#info=nx.info(G)
print('Basic graph info:', nx.info(G))
We have picked ‘mammal’ to generate some subgraphs and we also call up basic graph info based on networkx
. As a directed graph, KBpedia can be characterized by both ‘in degree’ and ‘out degree’. ‘in degree’ is the number of edges pointing to a given node (or vertex); ‘out degree’ is the opposite. The average across all nodes in KBpedia exceeds 1.3. Both measures are the same because our only edge type in this structure is subClassOf
, which is transitive.
Network Metrics and Operations
So we see that our KBpedia graph has loaded properly, and now we are ready to do some basic network analysis. Most of the analysis deals with the relations structure of the graph. NetworkX has a very clean interface to common measures and metrics of graphs, as our examples below demonstrate.
‘Density‘ is the ratio of actual edges in the network to all possible edges in the network, and ranges from 0 to 1. A ‘dense’ graph is one where the number of edges is close to the maximal number of edges; a ‘sparse’ graph is the opposite. The maximal number of edges is calculated as the potential connections, or nodes X (nodes -1). This potential is multiplied by two for a directed graph, since A → B and B → A are both possible. The density is thus the actual number of connections divided by the potential number. The density of KBpedia is quite sparse.
print('Density:', nx.density(G))
‘Degree‘ is a measure to find the most important nodes in graph, since a node’s degree is the sum of its edges. You can find the degree for an individual node, or the max ones as these two algorithms indicate:
print('Degree:', nx.degree(G,'Mammal'))
‘Average clustering‘ is the sum of all node clusterings. A node is clustered if it has a relatively high number of actual links to neighbors in relation to potential links to neighbors. A small-world network is one where the distance between random nodes grows in proportion to the natural log of the number of nodes in the graph. Low average clustering is an indicator of a small-world network.
print('Average clustering:', nx.average_clustering(G))
= 'Mammal'
G_node print('Clustering for node', G_node, ':', nx.clustering(G, G_node))
‘Path length‘ is calculated as the number of hop jumps traversing two end nodes is a network. An ‘average path length‘ measures shortest paths over a graph and then averages them. A small number indicates a shorter, more easily navigated graph on average, but there can be much variance.
print('Average shortest path length:', nx.average_shortest_path_length(G))
The next three measures throw an error, since KBpedia ‘is not strongly connected.’ ‘Eccentricity‘ is the maximum length between a node and its connecting nodes in a graph, with the ‘diameter‘ being the maximum eccentricity across all nodes and the ‘radius‘ being the minimum.
print('Eccentricity:', nx.eccentricity(G))
print('Diameter:', nx.diameter(G))
print('Radius:', nx.radius(G))
The algorithms that follow take longer to calculate or produce long listings. The first such measure we see is ‘centrality‘, which in NetworkX is the number of connections to a given node, with higher connectivity a proxy for importance. Centrality can be measured in many different ways; there are multiple options in NetworkX.
# Calculate different centrality measures
print('Centrality:', nx.degree_centrality(G))
print('Centrality (eigenvector):', nx.eigenvector_centrality(G))
print('In-degree centrality:', nx.in_degree_centrality(G))
print('Out-degree centrality:', nx.out_degree_centrality(G))
Here are some longer analysis routines (unfortunately, betweenness takes hours to calculate):
# Calculate different centrality measures
print('Betweenness:', nx.betweenness_centrality(G))
As a directed graph, some NetworkX measures are not applicable. Here are some of them:
nx.is_connected(G)
nx.connected_components(G)
.
Subgraphs
We earlier showed code for extracting a subgraph. Here is a generalized version of that function. Replace the ‘Bird’ reference concept with any other valid RC from KBpedia:
# Provide label for current KBpedia reference concept
= 'Bird'
rc # Get the subgraph of all nodes around node
= [ n[1] for n in G.edges(rc) ]
sub # Actually, need to add the 'rc' node too
sub.append(rc)#
# Now create a new graph, which is the subgraph
= nx.Graph(G.subgraph(sub))
sg #
# Print the nodes of the new subgraph and edges
print('Subgraph nodes:', sg.nodes())
print('Subgraph edges:', sg.edges())
DeepGraphs
There is a notable utility package called DeepGraphs (and its documentation) that appears to offer some nice partitioning and quick visualization options. I have not installed or tested it.
Full Network Exchange
So far, we have seen the use of networks in driving visualizations (CWPK #56) and, per above, as knowledge artifacts with their own unique characteristics and metrics. The next role we need to highlight for networks is as information providers and graph-based representations of structure and features to analytical applications and machine learners.
NetworkX can convert to and from other data formats:
All of these are attractive because PyTorch has direct routines for them.
NetworkX can also read and write graphs in multiple formats, some of which include:
There are also standard NetworkX functions to convert node and edge labels to integers (such as networkx.relabel.convert_node_labels_to_integers
), relabel nodes (networkx.relabel.relabel_nodes
), set node attributes (networkx.classes.function.set_node_attributes
), or make deep copies (networkx.Graph.to_directed
).
There are also certain packages that integrate well with NetworkX and PyTorch and related packages such as direct imports or exports to the Deep Graph Library (DGL) (see CWPK #68 and #69), or built-in converters or the DeepSNAP package may provide a direct bridge between NetworkX and PyTorch Geometric (PyG) (see CWPK #68 and #70).
However, these representations do NOT include the labeled information or annotations. Knowledge graphs, like KBpedia, have some unique aspects that are not fully captured by an existing package like NetworkX.
Fortunately, the previous extract-and-build routines at the heart of this Cooking with Python and KBpedia series are based around CSV files, the same basis as the pandas
package. Via pandas
we can capture the structure of KBpedia, plus its labels and annotations. Further, as we will see in the next installment, we can also capture full pages for most of these RCs in KBpedia from Wikipedia. This addition will greatly expand our context and feature basis for using KBpedia for machine learning.
For now, I present below two of these three inputs, extracted directly from the KBpedia knowledge graph.
KBpedia Structure
The first of two extraction files useful to all further installments in this Part VI provides the structure of KBpedia. This structure consists of the hierarchical relation between reference concepts using the subClassOf
subsumption relation and the assignment of that RC to a typology (SuperType
). I first presented this routine in CWPK #56 and it, indeed, captures the requisite structure of the graph:
### KEY CONFIG SETTINGS (see extract_deck in config.py) ###
# 'kb_src' : 'standard' # Set in master_deck
# 'loop_list' : kko_order_dict.values(), # Note 1
# 'base' : 'C:/1-PythonProjects/kbpedia/v300/build_ins/mappings/',
# 'ext' : '.csv',
# 'out_file' : 'C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv',
def graph_extractor(**extract_deck):
print('Beginning graph structure extraction . . .')
= extract_deck.get('loop_list')
loop_list = extract_deck.get('loop')
loop = extract_deck.get('class_loop')
class_loop = extract_deck.get('base')
base = extract_deck.get('ext')
ext
# Note 2
= ['kko.SocialSystems','kko.Products','kko.Methodeutic','kko.Eukaryotes',
parent_set 'kko.ConceptualSystems','kko.AVInfo','kko.Systems','kko.Places',
'kko.OrganicChemistry','kko.MediativeRelations','kko.LivingThings',
'kko.Information','kko.CopulativeRelations','kko.Artifacts','kko.Agents',
'kko.TimeTypes','kko.Symbolic','kko.SpaceTypes','kko.RepresentationTypes',
'kko.RelationTypes','kko.OrganicMatter','kko.NaturalMatter',
'kko.AttributeTypes','kko.Predications','kko.Manifestations',
'kko.Constituents']
if loop is not 'class_loop':
print("Needs to be a 'class_loop'; returning program.")
return
= ['target', 'source', 'weight', 'SuperType']
header = extract_deck.get('out_file')
out_file = []
cur_list with open(out_file, mode='w', encoding='utf8', newline='') as output:
= csv.writer(output)
csv_out
csv_out.writerow(header) for value in loop_list:
print(' . . . processing', value)
= []
s_set = eval(value)
root = root.descendants()
s_set = value.replace('kko.','')
frag for s_item in s_set:
= list(s_item.subclasses())
child_set = len(list(child_set))
count
# Note 3
if value not in parent_set:
for child_item in child_set:
= str(s_item)
s_rc = str(child_item)
child = s_rc + child
new_pair = str(new_pair)
new_pair
cur_list.append(new_pair)= s_rc.replace('rc.','')
s_rc = child.replace('rc.','')
child = (s_rc,child,count,frag)
row_out
csv_out.writerow(row_out)elif value in parent_set:
for child_item in child_set:
= str(s_item)
s_rc = str(child_item)
child = s_rc + child
new_pair = str(new_pair)
new_pair if new_pair not in cur_list:
cur_list.append(new_pair)= s_rc.replace('rc.','')
s_rc = child.replace('rc.','')
child = (s_rc,child,count,frag)
row_out
csv_out.writerow(row_out)elif new_pair in cur_list:
continue
output.close() print('Processing is complete . . .')
**extract_deck) graph_extractor(
Note, again, the parent_set
ordering of typology processing at the top of this function. This ordering processes the more distal (leaf) typologies first, and then ignores subsequent processing of identical structural relationships. This means that the graph structure is cleaner and all subsumption relations are “pushed down” to their most specific mention.
You can inspect the actual structure file produced using this routine, which is also the general basis for reading into various machine learners:
import pandas as pd
= pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv')
df
df
KBpedia Annotations
And, we also need to bring in the annotation values. The annotation extraction routine was first presented and described in CWPK #33, and was subsequently generalized and brought into conformance with our configuration routines in CWPK #33. Note, for example, in the header
definition, how we are able to handle either classes
or properties
. In this instance, plus all subsequent machine learning discussion, we concentrate on the labels and annotations for classes
:
### KEY CONFIG SETTINGS (see extract_deck in config.py) ###
# 'krb_src' : 'extract' # Set in master_deck
# 'descent_type' : 'descent',
# 'loop' : 'class_loop',
# 'loop_list' : custom_dict.values(), # Single 'Generals' specified
# 'out_file' : 'C:/1-PythonProjects/kbpedia/v300/extractions/classes/Generals_annot_out.csv',
# 'render' : 'r_label',
def annot_extractor(**extract_deck):
print('Beginning annotation extraction . . .')
= ''
r_default = ''
r_label = ''
r_iri = extract_deck.get('render')
render if render == 'r_default':
set_render_func(default_render_func)elif render == 'r_label':
set_render_func(render_using_label)elif render == 'r_iri':
set_render_func(render_using_iri)else:
print('You have assigned an incorrect render method--execution stopping.')
return
= extract_deck.get('loop_list')
loop_list = extract_deck.get('loop')
loop = extract_deck.get('out_file')
out_file = extract_deck.get('class_loop')
class_loop = extract_deck.get('property_loop')
property_loop = extract_deck.get('descent_type')
descent_type """ These are internal counters used in this module's methods """
= []
p_set = []
a_ser = 1
x = []
cur_list with open(out_file, mode='w', encoding='utf8', newline='') as output:
= csv.writer(output)
csv_out if loop == 'class_loop':
= ['id', 'prefLabel', 'subClassOf', 'altLabel',
header 'definition', 'editorialNote', 'isDefinedBy', 'superClassOf']
else:
= ['id', 'prefLabel', 'subPropertyOf', 'domain', 'range',
header 'functional', 'altLabel', 'definition', 'editorialNote']
csv_out.writerow(header) for value in loop_list:
print(' . . . processing', value)
= eval(value)
root if descent_type == 'descent':
= root.descendants()
p_set elif descent_type == 'single':
= root
a_set
p_set.append(a_set)else:
print('You have assigned an incorrect descent method--execution stopping.')
return
for p_item in p_set:
if p_item not in cur_list:
= p_item.prefLabel
a_pref = str(a_pref)[1:-1].strip('"\'')
a_pref = p_item.is_a
a_sub for a_id, a in enumerate(a_sub):
= str(a)
a_item if a_id > 0:
= a_sub + '||' + str(a)
a_item = a_item
a_sub if loop == 'property_loop':
= ''
a_item = p_item.domain
a_dom for a_id, a in enumerate(a_dom):
= str(a)
a_item if a_id > 0:
= a_dom + '||' + str(a)
a_item = a_item
a_dom = a_item
a_dom = p_item.range
a_rng = str(a_rng)[1:-1]
a_rng = ''
a_func = ''
a_item = p_item.altLabel
a_alt for a_id, a in enumerate(a_alt):
= str(a)
a_item if a_id > 0:
= a_alt + '||' + str(a)
a_item = a_item
a_alt = a_item
a_alt = p_item.definition
a_def = str(a_def)[2:-2]
a_def = p_item.editorialNote
a_note = str(a_note)[1:-1]
a_note if loop == 'class_loop':
= p_item.isDefinedBy
a_isby = str(a_isby)[2:-2]
a_isby = a_isby + '/'
a_isby = ''
a_item = p_item.superClassOf
a_super for a_id, a in enumerate(a_super):
= str(a)
a_item if a_id > 0:
= a_super + '||' + str(a)
a_item = a_item
a_super = a_item
a_super if loop == 'class_loop':
= (p_item,a_pref,a_sub,a_alt,a_def,a_note,a_isby,a_super)
row_out else:
= (p_item,a_pref,a_sub,a_dom,a_rng,a_func,
row_out
a_alt,a_def,a_note)
csv_out.writerow(row_out)
cur_list.append(p_item)= x + 1
x print('Total unique IDs written to file:', x)
print('The annotation extraction for the', loop, 'is completed.')
You can inspect this actual file of labels and annotations using this routine:
import pandas as pd
= pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/classes/Generals_annot_out.csv')
df
df
We will add Wikipedia pages as a third source for informing our machine learning tests and experiments in our next installment.
Untested Potentials
One area in extended NetworkX capabilities that we do not test here is community structure using the Louvain Community Detection package.
Additional Documentation
Here are additional resources on network analysis and NetworkX:
- Exploring and Analyzing Network Data with Python is a good description of basic measures
- Analysis of Large-Scale Networks: NetworkX provides good examples and explanations
- Graph Algorithms: Graph Analysis and Graph Learning are nice examples and drawings
- Network Analysis with Python and NetworkX Cheat Sheet
- Functions to convert NetworkX graphs to and from other formats
- Knowledge Graph Representation with Jointly Structural and Textual Encoding reflects the structure + text nature of knowledge graphs.
*.ipynb
file. It may take a bit of time for the interactive option to load.