By: Steve Outing
Unless you’re a research professional, you might think that the most popular search engines can find most anything that’s on the Web, short of content behind subscription walls and on intranets. You couldn’t be more wrong.
Actually, if you rely on a search engine like Google, Altavista, Hotbot, or Lycos, you’re only gaining access to a fraction of what’s available online. The Web is a fantastic reporting and research tool for journalists, but there’s much more to it than meets the untrained eye.
Authors and search gurus Chris Sherman and Gary Price estimate that the amount of content on the Web that’s freely accessible represents 2 to 50 times what’s accessible via the major search engines. They call this the “invisible Web,” which they detail in their new book, “The Invisible Web: Uncovering Information Sources Search Engines Can’t See” (CyberAge Books, 2001). This is an important text in helping journalists understand how to get more out of the Internet as a research tool.
It’s difficult to estimate just how big this invisible Web is. The co-authors hedge their bets with the 2- to 50-times estimate. (Previous estimates of the invisible Web’s size have put it as high as 300 times as big.) But Sherman points out that the “visible” Web is estimated to be 2-4 billion pages. The most comprehensive search engine, he says, is Google — which has catalogued about 1.6 billion pages. So even if you only want to search the visible Web, you still need to use multiple search engines to have access to all 2-4 billion pages. (That’s lesson No. 1: If you don’t find what you seek with one search engine, try the same search on others. Or use metasearch engines like Dogpile or MetaCrawler, which search across multiple search engines.)
Why is it invisible?
An obvious question: If all this content is published on the Web and is meant to be freely accessible, why don’t the search engines include it? As Sherman points out, it’s not really “invisible,” it’s just not visible via most search engines because they have chosen not to include it within their services.
Much of the invisible content is in various formats that the search engines’ robots (programs that “crawl” the Web looking for and indexing pages) aren’t programmed to seek out. For example, most search engines don’t catalog audio and video files; most don’t go into database Web sites and extract information; most don’t catalog PDF files (though Google now does); and so on. Sites that are database driven and whose pages are all dynamically generated (as opposed to having a single, static URL for each piece of content) are often ignored by the search engines.
The search engines have the ability, in theory, to index just about everything on the Web. Sherman says that if the search companies were willing to spend the money, they could index most everything on the invisible Web within a month. It’s simply an economic decision to limit what they index and make available for access on their freely available search sites. (It’s worthy of note that some of the enterprise search products from the search engine companies do catalog content formats that their free public search sites do not. For instance, Altavista’s enterprise offering can catalog 225 file formats; the Altavista.com site indexes only a half dozen.)
The key point is this: Anything that can be displayed in a browser window can be included in a search engine, says Sherman.
How to use the invisible Web
Reporters who use the Web as a tool should learn how to use the invisible Web. It is more work than searching the visible Web, say Sherman and Price, but the payoff can be huge.
The simplest technique for using the invisible Web is to search on Web sites themselves, instead of relying on the search engines. Some content that’s on a site may be in a format that the major search engines don’t index, so a search via Google, for example, won’t turn up all the content that’s on the site.
WorldBank.com is a good example of this. The site is freely accessible on the Web, and much of its content is on static Web pages, which show up in searches on the major search engines. However, some of its most valuable content is stored on the site in searchable database formats; this content is “invisible” to search engines and must be ferreted out by browsing or searching on the site itself. Even the Library of Congress Web site has visible as well as invisible content.
While it’s not likely that you’ll use Google to search for something on the Library of Congress site, that is a common strategy for seeking information on smaller sites. A Google search for information about a specific company, for example, might turn up pages from its corporate site. The trick is to recognize that there may be information and data stored on that site, freely accessible but invisible, but you’ll have to spend the time poking around.
Results from the general search engines sometimes turn up gems, which you can then go investigate further. A search for “peanut farming,” for instance, might turn up a reference to a peanut farming database — but no information from within the database. Sherman says one simple and useful trick in an instance like this is to search for “peanut farming database.” That will at least alert you to the existence of what can be a useful source of information.
There are lots of databased sites that are invisible to most search engines. Biography.com is an example, with thousands of biographies tucked away in a database structure that some search engines can’t see (or more precisely, don’t choose to see).
The Web site of the U.S. Centers for Disease Control contains much information about anthrax, but it will take searching the site itself to mine the information stores effectively.
Audio and video searches
As broadband Internet access continues to proliferate, more and more content on the Web is in audio and video formats — which most search engines can’t deal with. Price says an important trend is the emergence of services that convert audio into text and then provide searching from the transcripts.
For example, the Web site of the Newshour With Jim Lehrer has a search feature that can perform keyword searches on archives from the PBS television news program. Financial news provider Bloomberg has a new Video Player feature for its broadcasts which can do keyword searches within the broadcast content. And best of all, Compaq has an experimental site called SpeechBot that is a search engine for audio and video content hosted and played from a variety of Web sites.
Tools for the beat reporter
Sherman says that the invisible Web is probably a better tool for beat journalists than for general-assignment reporters who cover a wide variety of topics. If you have a specialty, it makes sense to spend time finding invisible Web sources on your topic. And this will take time.
Price likens the process of learning what the invisible Web has to offer for your beat to the process that any reporter goes through in learning a new beat. It’s about cultivating sources and learning what resources are available, so you know where to go when the need arises for specific information. And importantly, recognize that the invisible Web is a fast-changing, dynamic environment that will constantly challenge you to keep up with it. “There’s no one set of rules” for searching the invisible Web, he says.
The invisible Web takes work to navigate. But the rewards can be great for a journalist who makes the effort to learn what it can give.
The catch …
So, this all sounds wonderful, eh? Already, the visible Web is a great, time-saving research tool for journalists. There’s a ton of information on the visible Web. Can it get better, or will a new wave of “invisible Web” content (in amounts that dwarf the visible Web) make the information overload situation that much worse? We’re already swimming in an ocean of Web pages and proprietary databases. Do we need more to make the situation even worse?
I subscribe to the theory that for journalists, there can never be enough access to information. In this new age of the invisible Web, journalists are well positioned to become experts in tapping its wealth — in a way the general public cannot. As the amount of information online expands, journalists (and news researchers) can be instrumental in sorting through it all.
The invisible Web is just another item in the reporter’s toolkit. Learn to use it.
Other recent columns
In case you missed recent Stop The Presses!, here are links to the last few columns:
Sports League Sites Battle Media, Wednesday, Oct. 31
Honoring the Dead Online, Wednesday, Oct. 10
Are Newspaper Web Sites Dead?, Wednesday, Sept. 26
Attacks Lessons For News Web Sites, Wednesday, Sept. 19
Stopping Unauthorized Alterations Of Web Sites, Wednesday, Aug. 29
Archive of columns
Get Stop The Presses! by e-mail
If you would like to get e-mail delivery of the Stop The Presses! column, there are two options:
1) Text e-mail. I send out a text e-mail message containing abrief description of the current column, along with a URL link to theactual column on the E&P Web site. To receive these regular reminders, signup here.
2) HTML e-mail. If you prefer to receive the entire column, you can have it delivered to you as an HTML e-mail message whenever a new column is published. Sign up here.
Got a tip? Let me know about it
If you have a newsworthy item about the online news media business, please send me a note.