Mavlyutov et al. have posted a pre-print [1] of their upcoming paper to be presented at ESWC at the end of the month covering the most efficient representation of URIs in information systems. All of us who do large-scale work with the semantic Web or linked data should be interested in these findings.
To my knowledge, the paper is the first one to explicitly evaluate common data structures for encoding, storing and retrieving URIs at scale. As the unique identifiers for resources, there may be millions to billions needing to be stored and retrieved from triple stores or other database backends.
The authors compared a dozen different methods for storing URIs according to the standard needs to index, insert and retrieve URIs, including encoding and decoding, at scale. Memory and operation times were measured. The methods evaulated were specific RDF systems; various hash maps; various hash tables; binary search, B+, ART (adaptive radix), and lexicographic trees; and the HAT-trie.
Different operational needs may point to different methods. However, the authors conclude that “overall, the HAT-trie appears to be a good compromise taking into account all aspects, i.e., memory consumption, loading time, and look-ups. ART also appears as an appealing structure, since it maintains the data in sorted order, which enables additional operations like range scans and prefix lookups, and since it still remains time and memory efficient.”
This paper should be a useful reference for any group that needs to manage URIs at scale.