NYTimes.com to Categorize Article Pages, Develop Topic Indexes

Posted
By: Graham Webster The New York Times is planning a major addition to its Web site later this year, adding tens of thousands of topic-specific pages that will collect information from the day's news, the paper's archives, and possibly other sources in one place.

"The topic page for us is intended to be a dashboard of information," Rob Larson, director of product management and development for NYTimes.com, told E&P. In addition to the Times' own content, Larson said the paper is negotiating with About.com and other information Web sites to include more information.

Stories on the site will have more links to other NYTimes.com content, including the topic pages, helping readers access the massive stores of information the newspaper produces.

This will all be enabled by robust implementation of software already at work behind the scenes at NYTimes.com.

At the Times, posting articles -- a chore that can be as tedious as cut-and-paste or so seamless you don't even know it's happening -- is a highly automated process that categorizes and subcategorizes articles by content and topic.

After stories are exported from the paper's copy system, a two-stage automatic categorization process takes place, powered by software from Teragram Corporation.

In one round the system picks out "entities" such as publicly traded companies, people, or locations, and notes their inclusion, sometimes adding links to other information. At this point the system might notice that the story mentions, for example, Arnold Schwarzenegger. Then, in a more complicated process, the software uses linguistic algorithms to guess at the actual topic of the story.

For instance, if the story is about Schwarzenegger, is it about him as an actor, a body builder, or a governor? Or maybe it is an unrelated story about Jamie Lee Curtis, mentioning that she co-starred with Schwarzenegger in "True Lies." The software can even detect that a story is about an abstract topic, say "terrorism," even if the word itself never appears in the text, Yves Schabes, Teragram's president and co-founder, told E&P.

The editor need only approve or slightly adjust the computer's work to properly flag and categorize a story within the Times system.

"The software does the lion's share of the real work," Larson said. "The editor can pretty quickly choose the terms."

At the Times, this means the system will file the story in one or more of about 800 categories and note the mention of what Larson called "fairly large dictionaries" of organizations, names, and locations. Larson said the Times also recently added dictionaries of movie titles, and car makes and models. They're also considering adding health conditions, he said.

The Web site has already been using categorization for years to issue topic-based e-mail alerts to people who subscribe to Times News Tracker, which went from free to paid in May 2003. "It was a year that it was fully a free service, and we grew the audience to over 450,000 readers who had at least one active alert being sent to them," said Eliot Pierce, a product manager for NYTimes.com.

Now, at $29.95 a year, about 20,000 paid subscribers get 20 customized e-mail alerts, access to articles in their News Tracker categories up to 90 days old, and breaking news alerts. "These users are some of the most loyal, and perhaps our most active segment of our users on the site," Pierce said. Online advertisers can also buy the privilege of reaching those very active users, he added.

Categorization is also active in the Web site's travel section, where location-based guide pages are generated from Times archives. After introducing hotlinking to company information pages from stories in the business section, the Times also launched a college-related site in 2001.

"We began talking with [Teragram] about doing a pilot program for a special section of the Web site that was dedicated to college students and professors," Larson said. That site, at www.nytimes.com/college, offers two years of archive access sorted by academic discipline. Those pages are precursors to the larger topic page initiative now in the works.

Building the topic pages presents some challenges. Unlike search-result pages, which are generated live at the moment of a user's search, topic pages will be more persistent. This means that search engines will be able to index the pages, making every topic page a port of entry to the site. If someone Googles "George Pataki," for instance, the Times topic page might gradually its way towards the top of those results. To take advantage of that opportunity, Larson said, the site will be engineered for easy indexing.

Right now, Pierce said the Times is determining what role RSS feeds might play in the new pages. In the future, the same Teragram technology could be re-implemented to create user-personalized pages drawing from a custom mix of topics and "entities" of interest, but if they're planning such a thing, they're not talking.

Comments

No comments on this item Please log in to comment by clicking here