The enterprise Semantic Web, as all Semantic Web instances, by definition depends on semi-structured data. Generally lacking in the move toward a semi-structured data paradigm has been the creation of adequate processing engines for efficient and scalable storage and retrieval of semi-structured data.[1]
While tremendous effort has gone into data representations like XML, when it comes to positing or designing engines for manipulating that data the approach is to clone kludgy workarounds on to existing relational DBMSs or text search engines. Neither meet the test. Thus, as the semantic Web and its association to semi-structured data looks forward, two impediments stand like gatekeepers blocking progress: 1) efficient processing engines and 2) scalable systems and architectures.
Unlike structured or unstructured data, there is no accepted database engine specific to semi-structured data. Some systems attempt to use relational DBMS approaches from the structured end of the spectrum; other systems attempt to add some structure to standard unstructured search engines (see the figure in my related posting). Structured data is dominated by RDBMSs and unstructured data is largely the realm of text or search engines:
Attempts to manage the middle ground of semi-structured data has involved either modifying RDBMS systems to be XML enabled, adding some structure to existing IR systems, or developing new, native XML data systems from scratch. The native XML systems are relatively new and unproven. For a listing of native XML databases, plus generally useful discussion about the use of XML within databases, see Ron Bourret’s Web site.[2].
Semi-structured data models are sometimes called “self-describing” (or schema-less). These data models are often represented as labeled graphs, or sometimes labeled trees with the data stored at the leaves. The schema information is contained in the edge labels of the graph. Semi-structured representations also lend themselves well to data exchange or the integration of heterogeneous data sources.
However, all of these three approaches to managing semi-structured XML data — enabled RDBMSs, modified IR text engines, or native XML data systems — have their own strengths and weaknesses, as shown by the table below:
Type | Pros | Cons |
Because of their prevalence, XML-enabled RDBMSs are perhaps the most common approach, with all commercial vendors such as Oracle, IBM and Sybase offering their own versions. But realize that XML is itself text, much of its information requires text-based retrieval, and open XML schemas with the need to preserve ordering are very poorly suited to the relational data model. As a result, RDBMS options are very fragile, perform poorly for document-centric retrievals, and lose critical information.
IR-based text search systems do well on the text retrieval scale, but are not suited at all for storing and managing structured data. Further, many of these systems use in-line tagging of structural attributes. While this approach parses well and can seamlessly work with existing text token indexing, at scale it suffers the fatal flaw of requiring the complete re-indexing of existing content should new attributes or extensions be desired.
Finally, all native XML data systems perform poorly at scale. Some of these native systems build from a text-search basis, others from more object or relational approaches. But, in general, queries and other mechanisms are still highly XML document-centric, with very slow retrievals across large document repositories.
As XML and semi-structured data have become ubiquitous, clearly the path is opening in the marketplace for a “third way.” Later postings will look at efforts by new vendors such as Mark Logic to address this opportunity, as well as emerging efforts from BrightPlanet.
NOTE: This posting is part of an occasional series looking at a new category that I and BrightPlanet are terming the eXtensible Semantic Data Model (XSDM). Topics in this series cover all information related to extensible data models and engines applicable to documents, metadata, attributes, semi-structured data, or the processing, storing, indexing or semantic schemas and mappings of XML, RDF, OWL, or SKOS data. A major white paper will be produced at the conclusion of the series. Stay tuned! |
[1] Matteo Magnani and Danilo Montesi, “A Unified Approach to Structured, Semistructured and Unstructured Data,” Technical Report UBLCS-2004-9, Department of Computer Science, University of Bologna, 29 pp., May 29, 2004. See http://www.cs.unibo.it/pub/TR/UBLCS/2004/2004-09.pdf.
[2]See http://www.rpbourret.com/xml/XMLDatabaseProds.htm
and http://www.rpbourret.com/xml/XMLAndDatabases.htm.