Ranking blogs that deliver the biggest bang for the buck may not seem to flow naturally from detecting contamination in a city water system.
With a new mathematical formula, however, Carlos Guestrin and his research team at Carnegie Mellon University in Pittsburgh are setting their sights on improving how scientists monitor everything from algal blooms in lakes to subtle differences in how someone sits or slouches in a chair.
Scientists have long studied how information, influence or physical items move through networks. But by combining that field of research with how to optimally detect the flow in a cost-effective way, the Carnegie Mellon researchers have devised a formula, or algorithm, that could lead to dramatically improved sensor networks, whether geared toward political blogs or posture.
Guestrin, an assistant professor of computer science and machine learning, received his initial inspiration through a collaboration with civil engineers on a problem-solving competition sponsored by the Environmental Protection Agency. Given a model of an unnamed city’s network of water pipes, a simulation of water consumption and different contamination scenarios, the teams had to decide where to place a limited number of $5,000 sensors to detect contamination as quickly as possible.
The competition organizers initially declared every team a winner, to Guestrin’s frustration, but a subsequent evaluation awarded his team the top score. Buoyed by his success, he began thinking about how the central idea could be adapted for other applications, including how information spreads across the Internet.
“Somebody places a story and some people link to the story, and some people link to the links, and so on,” he said. “You can think of it as a cascade of information.”
The Cascades algorithm
But how would this cascade be modeled? What sensors, or blogs in this case, should be tapped to maximize the likelihood of capturing a big story early on during its propagation over the blogosphere?
A team of researchers and graduate students from Carnegie Mellon eventually created a complex mathematical equation called the cost-effective lazy forward-selection algorithm, later dubbed the Cascades algorithm for simplicity’s sake.
One part seeks to maximize reward, in this case detecting the most news in the least amount of time. Within the algorithm, that reward concept is captured by tallying the number of people who read a news item after it appears on a specific blog. If 10 million people read a story after its initial posting on Blog A but only 1,000 had read it beforehand, the story would be deemed both newsworthy and early-breaking for Blog A’s readers.
A second part of the algorithm seeks to minimize cost, namely the inordinate time that could be spent reading blogs. The team also exploited a mathematical relationship known as the law of diminishing returns.
“If I place five sensors in the water distribution system, the sixth sensor helps me much more than if I have 10,000 sensors and want to place the 10,001st,” Guestrin said. The same holds true for adding more blogs to a news-detection system.
For one scenario, the researchers assumed that a reader could peruse no more than 100 blogs while another scenario had a budget set at no more than 5,000 postings. For each, the researchers asked the same question: which blogs should one read to be the most up-to-date on newsworthy stories? The team considered 45,000 blogs in all, obtaining 10 million posts and identifying 350,000 news cascades over the course of 2006.
Although the resulting top 100 blogs contain some familiar names, the two scenarios led to almost completely different lists.
Given a budget of 100 blogs, the biggest bang for the buck belonged to the popular Instapundit blog, which featured more than 4,500 postings throughout the year. Assuming a budget of 5,000 posts, however, the top-scoring blog was the less well-known sisu site, which featured only 331 posts for all of 2006.
Making news in the blogosphere
The study, named the best student paper after being presented by graduate student Jure Leskovec at last year’s International Conference on Knowledge Discovery and Data Mining, subsequently set off several rounds of editorializing on sites that were either included and excluded from the lists, effectively generating some news cascades of its own.
“I haven’t yet written a paper about my own paper,” Guestrin said, laughing at the irony.
Among the many comments either praising or second-guessing the study, blogger Bora Zivkovic questioned the usefulness of ranking blogs based on 2006 data (“eons ago in Internet time”) and pointed out that his own highly-ranked blog, Science & Politics, had been mostly unattended since June 2006.
But Guestrin said the academic exercise, by design, only looked at one year’s worth of entries. And blogs such as Zivkovic’s, which excelled at posting big stories earlier than other sites but included relatively few items overall, would score higher based on the algorithm’s consideration of inlinks, outlinks and posting frequency than a blog that picked up big stories but interspersed them among many more items.
Another outcome of the study, Guestrin said, is a glimpse into what specific networks of bloggers consider newsworthy, a group decision-making process that helped propel sites devoted to coffee lovers and parents who homeschool their children into the top 100 blogs.
“It all depends on what the Web vernacular of the time is,” he said, speculating that a list based on 2008 postings would likely be more heavily skewed toward political sites.
Less obvious applications
Beyond its usefulness for news aggregator sites and others seeking to tame the Internet’s vast jungle of information, the research has led to less obvious applications.
In one offshoot, Guestrin’s group is collaborating with researchers at the UCLA-based Center for Embedded Networked Sensing to optimize how algal blooms are monitored. For surveying efforts at California’s Lake Fulmor and Lake Merced, among other bodies of water, Guestrin’s team has contributed a Cascades-based algorithm that points to the best route for a sensor-equipped boat to maximize its recording activity, given its dependence on limited battery power.
Guestrin is also working with another group at Carnegie Mellon to help design a chair that can detect subtle differences in the positions of people sitting on it, a tool geared toward senior citizens and disabled patients.
The starting point, Guestrin said, was a sensor that covered every square centimeter of the chair, but cost a cool $10,000. The challenge was to smartly position a much smaller number of sensors that could still accurately capture information identifying whether the sitter was reading, sleeping or hunched over to one side. Guestrin’s prototype sensor is now down to about $80. Eventually, he hopes, it will be used as benchmark for predicting posture.
Jon Kleinberg, a professor of computer science at Cornell University in Ithaca, N.Y., who has worked with some of the project’s collaborators previously but wasn’t involved with the Cascades effort, said the algorithm has provided “a powerful unifying framework for thinking about things ranging from news and diseases to fads.”
With a limited advertising budget, for example, a marketing firm would need to strike a balance between maximizing visibility while minimizing its overall investment. If the budget allowed for just a few billboards, hanging one in Manhattan’s Time Square would undoubtedly capture plenty of eyeballs.
“But you’d likely not want to place a second too close, or you’d simply be reaching the same crowd with both,” Kleinberg said.
As for identifying the online network that would collectively capture the biggest news early on, he said, the new research has helped to map out the landscape of sources with influence in the blogosphere, at least with like-minded readers.
If a blog is particularly close to a breaking story, though, is it good at noticing the news or adept at amplifying items into news?
That question and the larger issue of how online sites not only propagate but also shape information as it travels, Kleinberg said, are fertile grounds for further research — and undoubtedly, for more blogging.