Tom Ryan wanted to build something that could identify criminal behavior inside massive mobile networks, stock trading services, ecommerce sites, and other online operations. So he turned to a pair of familiar names for help: Facebook and the NSA.
He didn’t exactly knock on Facebook’s front door—let alone the NSA’s. But he did adopt a pair of sweeping software systems built by these giants of the online age, systems that help them juggle the massive amounts of digital information streaming into their computer data centers.
Ryan grabbed an NSA tool called Accumulo, which likely plays a key role in the agency’s notoriously widespread efforts to monitor internet traffic in the name of national security, and he paired it with a Facebook tool called Presto, used to quickly analyze the way people, ads, and all sorts of other things behave on the world’s largest social network. Both Facebook and the NSA, you see, have open sourced their software, meaning these tools are freely available to the world at large.
Ryan is the CEO of a small Silicon Valley startup called Argyle Data. Over the past sixteen months, he and his engineering team used Accumulo and Presto to fashion software that can root out fraud inside today’s massive online operations, and they’ve already deployed the thing with at least a few companies, including Vodafone, the British telecommunications giant that runs mobile phone networks across Europe.
Argyle is a nicely rounded metaphor for the recent evolution of the data-juggling technologies that drive our modern businesses. Over the past several years, massive web companies such as Google and Facebook—as well as similarly ambitious operations like the NSA—have built a new breed of software that can store and analyze data across tens, hundreds, and even thousands of machines, and now, these software tools are trickling down to the rest of the business world. “As a startup,” Ryan says, “you want to build on what’s new, not what’s old.”
The poster child for this movement is a software system called Hadoop, which was inspired by work originally done at Google. But Hadoop—at least as it was originally conceived—is now giving way to tools that operate at much faster speeds. Hadoop is a “batch” system, meaning you assign it a task and then wait a good while for the answer to come back. Newer systems are much better at operating at speed.
Argyle’s software is a prime example. Using machine learning and what’s called deep packet inspection, it analyzes the individual packets of data that stream across a network, and if a piece of data meets certain criteria—i.e. sets off certain flags—it gets shuttled into Accumulo, a massive database that can extend across myriad machines. “It helps us scan tens of millions to hundreds of millions of transactions a second,” Ryan says. Companies can then use a version of Presto to further analyze this data, executing specific queries in near real-time.
Christopher Nguyen, the CEO of a data analysis startup called Adatao who once worked with similar “big data” software inside Google, says that Arygle’s method isn’t necessarily the best way to analyze such massive amounts of information at speed. But he agrees that this is part of a much much larger movement towards “real-time” big data tools, tools that also include something called Spark, developed at the University of California at Berkeley, and various other software contraptions.
At the same time, Argyle’s story underlines another aspect of this movement. At the NSA, you see, Accumulo is likely part of a surveillance effort that underpins our online privacy, and as the tools like this make it easier to collect and analyze such enormous amounts data, they may help push us towards a world where privacy is eroded even further. Vodafone, after all, is using Argyle’s software to closely analyze data streaming across European wireless networks used by the general public.
According to Seth Schoen, a staff technologist with the Electronic Frontier Foundation, laws typically allow companies to use tools along the lines of Argyle—including deep packet inspection—to do things like fight fraud. But in the end, their affect on privacy boils down to the policy of each individual company. The good news with Argyle, as Ryan points out, is that the NSA built Accumulo so that organizations can closely control who, within their operation, has access to each individual piece of data. “It’s a trade off,” Ryan says. “Privacy is so important. But with more data-enrichment, you can improve the results of your analytics.”
--- Cade Metz, Wired
More from Wired