SAN JOSE, Calif. — You think the Web is big? In truth, it’s far bigger than it appears.
The Web is made up of hundreds of billions of Web documents — far more than the 8 billion to 20 billion claimed by Google or Yahoo. But most of these Web pages are largely unreachable by most search engines because they are stored in databases that cannot accessed by Web crawlers.
Now a San Mateo, Calif., start-up called Glenbrook Networks says it has devised a way to tunnel far into the “deep web” and extract this previously inaccessible information.
Glenbrook, run by a father-daughter team, demonstrated its technology by building a search engine that scoops up job listings from the databases of various Web sites, something the company claims most search engines cannot do. But there are myriad other applications as well, the founders say.
“Most of the information out there, people want you to see,” said Julia Komissarchik, Glenbrook Networks’ vice president of products. “But it’s not designed to be accessed by a machine like a search engine. “Its requires human intervention.”
This is particularly true of Web pages that are stored in databases. Many ordinary Web pages are static files that exist permanently on a server somewhere. But an untold number of pages do not exist until the very moment an individual fills out a form on a Web site and asks for the information. Online dictionaries, travel sites, library catalogs and medical databases are few such examples.
Komissarchik and her father, Edward Komissarchik, say they have figured out how to analyze the forms on Web pages and understand the type of information the sites are looking for.
Then, Glenbrook’s Web crawlers use artificial intelligence to walk themselves through sometimes complex Web forms, answering questions, such the location of their desired job, in the same way a human would.
Julia Komissarchik likens the process to cracking a safe.
“The way to think of it is, you case the joint,” she said. “The scout goes through the form and tries a few options to see what the results will be. Then you have a mastermind or safecracker who gets all this information from the scout and devises a method to open the forms.”
Finally, she said, the “harvesters” spring into action to gather up all the information.
“As soon as you know the combination, then you can open all the micro-safes, if you will, that are sitting there,” said Edward Komissarchik.
The Komissarchiks immigrated from Russia in 1990. Julia was a math major and computer science minor at the University of California, Berkeley. Edward graduated from Moscow University as a math professor and was a professor and researcher who studied databases.
Glenbrook’s technology is not entirely new, says Gary Price, a Maryland research librarian who co-authored a book on the deep Web called “The Invisible Web.”
“The whole idea of having technology fill out a form and pull results has been around for years,” Price said. But he added, “I think this company could be able to do something with the idea of marketing all this data they’ve collected.”
Glenbrook is far from alone in trying to access the deep web. Yahoo is not alone in its efforts to index the deep Web. Yahoo announced partnerships with National Public Radio, the Library of Congress, the New York Public Library and others to index the content in their databases. And Google has added WorldCat, a comprehensive bibliographic database previously accessible only through libraries, to its search results.
Edward Komissarchik said one business opportunity for the company might be to collect and sell hard-to-get data to information brokers such Dunn and Bradstreet. Another possibility is to launch their own specialized Internet search site, focused on an area such as local business directories or jobs.
Edward Komissarchiks said that Glenbrook could extract the many job listings that are stashed away in databases on corporate Web sites.
“We can go directly to the owner of the information, which is the employer,” he said.
The number of job-related Web sites has mushroomed in recent years. Several smaller sites have emerged — including SimplyHired and Indeed — whose aim is to scoop up all the job listings on various Web sites.
“It was a really a great showcase for us,” Clavier said of the Glendor jobs Web site. “But it’s not where we see the biggest opportunity for the company. We don’t plan to launch a competitor to SimplyHired or Indeed or WorkZoo or those guys.”
The Komissarchiks are especially intrigued by the possibility of collecting detailed information about local businesses — such as business hours and product information — and making it more readily available.
“The deep Web is the future,” Edward Komissarchik said.