Posted:April 11, 2008

Another Deep Web Barrier Falls

As late as 2002, no single search engine indexed the entire surface Web. There is much that has been written about that time, but emergence of Google (indeed others, it was a key battle at the time), worked to extend full search coverage to the Web, ending the need for so-called desktop metasearchers, then the only option for getting full Web search coverage.

Strangely, though full coverage of document indexing had been conquered for the Web, dynamic Web sites and database-backed sites fronted by search forms were also emerging. Estimates as of about 2001, made by myself and others, suggested such ‘deep Web‘ content was many, many times larger than the indexable document Web and was found in literally hundreds of thousands of sites.

Standard Web crawling is a different technique and technology than “probing” the contents of searchable databases, which require a query to be issued to a site’s search form. A company I founded, BrightPlanet, but many others such as Copernic or Intelliseek and others, many of which no longer exist, were formed with the specific aim to probe these thousands of valuable content sites.

From those company’s standpoints, mine at that time as well, there was always the threat that the major search engines would draw a bead on deep Web content and use their resources and clout to appropriate this market. Yahoo, for example, struck arrangements with some publishers of deep content to index their content directly, but that still fell short of the different technology that deep Web retrieval requires.

It was always a bit surprising that this rich storehouse of deep Web content was being neglected. In retrospect, perhaps it was understandable: there was still the standard Web document content to index and conquer.

Today, however, Google posted on one of its developer blog sites, Crawling through HTML forms, written by Jayant Madhavan and Alon Halevy, noted search and semantic Web researcher, announcing its new deep Web search:

In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn’t find and index for users who search on Google. Specifically, when we encounter a <FORM> element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.

To be sure, there are differences and nuances to retrieval from the deep Web. What is described here is not truly directed nor comprehensive. But, the barrier has fallen. With time, and enough servers, the more inaccessible aspects of the deep Web will fall to the services of major engines such as Google.

And, this is a good thing for all consumers desiring full access to the Web of documents.

So, an era is coming to a close. And this, too, is appropriate. For we are also now transitioning into the complementary era of the Web of data.

Schema.org Markup

headline:
Another Deep Web Barrier Falls

alternativeHeadline:

author:

image:

description:
As late as 2002, no single search engine indexed the entire surface Web. There is much that has been written about that time, but emergence of Google (indeed others, it was a key battle at the time), worked to extend full search coverage to the Web, ending the need for so-called desktop metasearchers, then the […]

articleBody:
see above

datePublished:

Leave a Reply

Your email address will not be published. Required fields are marked *