Posted:October 15, 2008

Research Shows Natural Fit between Wikipedia and Semantic Web

W3C Semantic WebWikipedia

SWEETpedia Listing of 163 Research Articles; NZ Technical Report Affirm Trend

An earlier popular entry of this AI3 blog was “99 Wikipedia Sources Aiding the Semantic Web”. Each academic paper or research article in that compilation was based on Wikipedia for semantic Web-related research. Many of you suggested additions to that listing. Thanks!

Wikipedia continues to be an effective and unique source for many information extraction and semantic Web purposes. Recently, I needed to update my own research and found that many valuable new papers have been added to the literature.

I thus decided to make a compilation of such papers a permanent feature — which I’ve named SWEETpedia — and to update it on a periodic basis. You can now find the most recent version under the permanent SWEETpedia page link.

Hint, hint: Check out this link to see the 163 Wikipedia research sources!

NOTE: If you know of a paper that I’ve overlooked, please suggest it as a comment to this posting and I will add it to the next update.

Status of Wikipedia

Meanwhile, a complementary technical report, Mining Meaning from Wikipedia [1], was just released from the University of Waikato in New Zealand. It is a fantastic resource for anyone in this field.

For starters, it summarizes the size and status of the English-version Wikipedia with a more discerning eye than usual:

Categories 390,000
Articles and related pages 5,460,000
redirects 2,970,000
disambiguation pages 110,000
lists and stubs 620,000
bona-fide articles 1,760,000
Templates 174,000
infoboxes 9,000
other 165,000
Links
between articles 62,000,000
between category and subcategory 740,000
between category and article 7,270,000

The size, scope and structure of Wikipedia make it an unprecedented resource for researchers engaged in natural language processing (NLP), information extraction (IE) and semantic Web-related tasks. Further, the more than 250 language versions of Wikipedia also make it a great resource for multi-lingual and translation studies.

Growth of SWEETpedia

In the eight months since posting the semantic Web-related research papers using Wikipedia, my new SWEETpedia listing has grown by about 65%. There are now 63 new papers, bringing the total to 163.

Of course, these are not the only academic papers published about or using Wikipedia. The SWEETpedia listing is specifically related to structure, term, or semantic extractions from Wikipedia. Other research about frequency of updates or collaboration or growth or comparisons with standard encyclopedias may also be found under Wikipedia’s own listing of academic studies.

Wikipedia Research Papers by Year

This graph indicates the growth in use of Wikipedia as a source of semantic Web research. It is hard to tell if the effort is plateauing or not; the apparent slight dip in 2008 is too early to yet conclude that.

For example, the current SWEETpedia listing adds another 35% more listings for 2007 to the earlier records. It is likely many 2008 papers will also be discovered later in 2009. Many of the venues at which these papers get presented can be somewhat obscure, and new researchers keep entering the field.

However, we can conclude that Wikipedia is assuming a role in semantic Web and natural language research never before seen for other frameworks.

Kinds of Semantic Web-related Research

As noted, the new 82-page technical report by Olena Medelyan et al. from the University of Waikato in New Zealand, Mining Meaning from Wikipedia [1], is now the must-have reference for all things related to the use of Wikipedia for semantic Web and natural language research.

Olena and her co-authors, Catherine Legg, David Milne and Ian Witten, have each published much in this field and were some of the earliest researchers tapping into the wealth of Wikipedia.

They first note the many uses to which Wikipedia is now being put:

  • Wikipedia as an encyclopedia — the standard use familiar to the general public
  • Wikipedia as corpus — large text collections for testing and modeling NLP tasks
  • Wikipedia as a thesaurus — equivalent and hierarchical relationships between terms and related or synoymous terms
  • Wikipedia as a database — the extraction and codification of structure and structural relationships
  • Wikipedia as an ontology — the formal expression of relationships in semantic Web and logical constructs, and
  • Wikipedia as a network structure — relationship analysis and mining through Wikipedia’s representation as a network graph.

These type of uses then enable the authors to place various research efforts and papers into context. They do so via four major clusters of relevant tasks related to language processing and the semantic Web:

Natural Language Processing (NLP) Tasks:
Semantic relatedness
Word sense disambiguation
words and phrases
named entities
thesaurus and ontology terms
Co-reference resolution
Multilingual alignment
Information Retrieval Tasks:
Query expansion
Multilingual retrieval
Question answering
Entity ranking
Text categorization
Topic indexing

Information Extraction (IE) Tasks:
Semantic relations in raw (unstructured) text
Semantic relations in structure
Typing (classifying) named entities

Ontology Building Tasks:
Knowledge organization
Named entities
Thesaurus information
Ontology alignment
Facts extraction and assertion

There are many interesting observations throughout this report. There are also useful links to related tools, supporting and annotated datasets, and key researchers in the field.

I highly recommend this report as the essential starting point for anyone first getting into these research topics. Many of the newly added references to the SWEETpedia listing arose from this report. Reading the report is useful grounding to know where to look for specific papers in a given task area.

Though clearly the authors have their own perspectives and research emphases, they do an admirable job of being complete and even-handed in their coverage. Basic review reports such as this play an important role in helping to focus new research and make it productive.

Excellent job, folks! And, thanks!


[1] Olena Medelyan, Catherine Legg, David Milne and Ian H. Witten, 2008. Mining Meaning from Wikipedia, Working Paper Series ISSN 1177-777X, Department of Computer Science, The University of Waikato (New Zealand), September 2008, 82 pp. See http://arxiv.org/ftp/arxiv/papers/0809/0809.4530.pdf.

Schema.org Markup

headline:
Research Shows Natural Fit between Wikipedia and Semantic Web

alternativeHeadline:

author:

image:

description:
SWEETpedia Listing of 163 Research Articles; NZ Technical Report Affirm Trend An earlier popular entry of this AI3 blog was “99 Wikipedia Sources Aiding the Semantic Web”. Each academic paper or research article in that compilation was based on Wikipedia for semantic Web-related research. Many of you suggested additions to that listing. Thanks! Wikipedia continues […]

articleBody:
see above

datePublished:

Leave a Reply

Your email address will not be published. Required fields are marked *