New endeavors aim to build a better Internet

Dec. 10, 2007, 2:25 PM UTC

By Bryn Nelson

How do computers know what users really want when they ask for a concise hotel review, a way to kill bacteria or a picture labeled “breakfast” in Arabic? To the annoyance of Internet surfers adrift on oceans of online information, computers often don’t have a clue, even with the compendium of collective wisdom often referred to as Web 2.0.

University of Washington computer scientist and search engine pioneer Oren Etzioni is hoping to make today’s “dumb” computers far more consumer-friendly. As part of a larger push in the field, his latest projects are providing a sneak preview of how online applications might look in a more intuitive Web 3.0 of the not-so-distant future.

“I think that right now, there is the expectation that people will do a lot of the work,” Etzioni said. “The Web is cool, but to get something done like set up a vacation in Italy or even decide when’s the right time to buy your airline ticket to get the right price, it actually demands quite a bit of manual labor.”

Shifting the work to the machine
Web 2.0, Etzioni said, is about sharing information through “the wisdom of the crowds” and distributing labor across a large workforce, as sites like Wikipedia and Flickr have done. World Wide Web inventor Tim Berners-Lee has envisioned the “Semantic Web” as a key part of the next phase, in which all the stored knowledge can be converted to data that is easily retrieved, processed and integrated into a wide range of new applications. “Web 3.0 is trying to push more and more of that labor to the machine,” Etzioni said, “so that the machine can do the work for you.”

Among the crop of new endeavors aiming toward that goal is the University of Washington’s KnowItAll, led by Etzioni. “The idea is to take every sentence on the Web and try to extract the basic facts from it,” he said. Subtle nuances aren’t likely to be captured by the project’s strategy of natural language processing, as it’s known, but clear assertions such as “Thomas Edison invented the phonograph” or “My hotel room was very nice” could be stored in a massive database that merges available information.

One application, called Opine, tries to capture the essence of potentially useful but often lengthy and redundant online reviews. “There are more and more reviews of more and more products online, and I find that I spend more and more time shopping, not less,” Etzioni said. “I get more information, but it takes forever.”

Opine is designed to save time by scanning reviews and extracting their key attributes, such as a hotel’s location and the staff’s friendliness. Like an automated version of the popular Zagat restaurant guide, the program can condense opinions into a summary, while allowing users to drill down for more information through hyperlinks. So far, the application is only a research prototype, though Etzioni hopes his group’s demonstration of its capabilities will encourage more development.

The still-in-development application similarly permits queries on the topics of nutrition, general knowledge and the history of science by mining assertions contained in more than 100 million online pages. Even so, Etzioni says the prototype’s knowledge base covers just 1 percent of the current Web. “So you can imagine that if it’s run over the entire Web, which is what we want to do in the future, then it really can provide you with a wealth of information,” he said.

Etzioni brought his data-mining approach to the travel industry last year with the launch of Farecast. The start-up company uses data feeds and proprietary algorithms to comb the Web for the cheapest airfares and then predict whether a traveler should snap up a seat or wait for a better bargain.

“So it both gathers a huge amount of data and can offer analysis of the past and predictions for the future,” Etzioni said. “It will tell you: Is this a good time to buy? Are prices likely to go up or likely to go down?” Earlier this year, the site began predicting hotel prices as well.

The Web of the future
Exactly how the Web of the future will process and distribute information has yet to be fully resolved. Some efforts, like San Francisco-based Metaweb Technologies’ Freebase , are converting Wikipedia entries and other existing databases to machine-readable formats. To date, Freebase has stored information on about 30,000 movies and 580,000 famous people, according to Robert Cook, Metaweb’s co-founder and executive vice president of products.

The database “tends to be of higher quality” than that compiled through natural language processing, Cook said, though he conceded that it covers a narrower knowledge set. The information can also skew toward the interests of the avid community helping to curate some of Freebase’s newest entries, including a summary of every “X-Files” episode, the causes of death for various celebrities, statistical links between fantasy football and real NFL teams, and an annotated map of all human genes with links to relevant research articles.

At IBM’s Almaden Research Center in San Jose, Calif., a project known as Avatar Semantic Search has taken an approach more akin to Etzioni’s, though with a focus on Intranet systems that host companies’ e-mail and messaging systems. As an example of the project’s utility, Shiv Vaithyanathan, the center’s manager of Unstructured Information Mining, recalled his frustration in locating the phone number of a student who had included the digits in just one or two e-mails out of several hundred sent. “I was trying to guess when it was that he sent me the e-mail that contained his number,” Vaithyanathan said.

The IBM system aims to take guesswork out of similar queries. “Ideally, a semantic search needs to do several things: identify the sequence as a phone number and realize from the way the sentence is written that the phone number belongs to the person who sent the e-mail,” he said. “Once we know what is it that you want, then packaging it up and delivering it to you is not that hard of a job.”

Raising the bar
In September, Etzioni’s group revealed another peek at what search engines might deliver in the future with the debut of a research prototype called . Unlike typical image search engines, PanImages allows word translations across hundreds of languages and sends a query for files tagged with the appropriate word to both Flickr and Google Images, then displays results from the sites on a split screen.

“It’s an order of magnitude more languages than people have supported in translation systems before,” Etzioni said. “And where it’s really a boon is for people who speak less popular languages.” Slovenian or Hungarian speakers might be constrained if searching for flower images tagged only in their native tongue (“cvet” and “virag,” respectively). But running the same search with PanImages can retrieve blooms from around the world.

Word translations are based on automated readings of hundreds of dictionaries and wiktionaries on the Internet with some sophisticated reasoning added in, “so the whole is greater than the sum of the parts,” he said. Like Wikipedia, the site hasn’t been immune from mischievous intent. But its availability to the public has yielded translations for the site’s main interface in nearly 50 languages. Word by word, the translation database is expanding and evolving in what Etzioni describes as “Web 2.0 meets Web 3.0.”

As they mature, Etzioni says Internet applications are raising the bar on the quality and transparency of available information. Beyond that is the promise of helping people keep better track of RSS feeds, blogs, social networking sites and other chunks of data that could easily consume every waking moment. “So my belief is that we need technology – those of us who want to have a chance to take a walk outside and breathe the fresh air once in a while or be with our kids – to help us manage the flow of information,” he said.

For the countless surfers tethered to the Internet, that act of liberation may be one of Web 3.0’s most promising aspects yet.

Bryn Nelson