While the collection is impressive for its sheer volume, the data doesn't include sensitive information like passwords, credit card numbers, or Social Security numbers. It does, though, contain profiles of hundreds of millions of people that include home and cell phone numbers, associated social media profiles like Facebook, Twitter, LinkedIn, and Github, work histories seemingly scraped from LinkedIn, almost 50 million unique phone numbers, and 622 million unique email addresses.
"It’s bad that someone had this whole thing wide open," Troia says. "This is the first time I've seen all these social media profiles collected and merged with user profile information into a single database on this scale. From the perspective of an attacker, if the goal is to impersonate people or hijack their accounts, you have names, phone numbers, and associated account URLs. That's a lot of information in one place to get you started."
leaktown.com Troia found the server while looking for exposures with fellow security researcher Bob Diachenko on the web scanning services BinaryEdge and Shodan. The IP address for the server simply traced to Google Cloud Services, so Troia doesn't know who amassed the data stored there. He also has no way of knowing if anyone else found and downloaded the data before he did, but notes that the server was easy to find and access. WIRED checked six people's personal email addresses against the data set; four were there and returned accurate profiles. Troia reported the exposure to contacts at the Federal Bureau of Investigation. Within a few hours, he says, someone pulled the server and the exposed data offline. The FBI declined to comment for this story.
Of Unknown Origin
The data Troia discovered seems to be four data sets cobbled together. Three were labeled, perhaps by the server owner, as coming from a data broker based in San Francisco called People Data Labs. PDL claims on its website to have data on over 1.5 billion people for sale, including almost 260 million in the US. It also touts more than a billion personal email addresses, more than 420 million LinkedIn URLs, more than a billion Facebook URLs and IDs, and more than 400 million phone numbers, including more than 200 million valid US cellphone numbers.
PDL cofounder Sean Thorne says that his company doesn't own the server that hosted the exposed data, an assessment Troia agrees with based on his limited visibility. It's also unclear how the records got there in the first place.
“The owner of this server likely used one of our enrichment products, along with a number of other data-enrichment or licensing services," says Sean Thorne, cofounder of People Data Labs. "Once a customer receives data from us, or any other data providers, the data is on their servers and the security is their responsibility. We perform free security audits, consultations, and workshops with the majority of our customers."
Troia thinks it's unlikely that People Data Labs was breached, since it would be simpler to just buy data from the company. An attacker on a budget could also sign up for a free trial that PDL advertises, offering 1,000 consumer profiles per month. "One thousand profiles to 1,000 burner accounts and you've got pretty much all of it," Troia points out.
One of the other data sets is labeled "OXY," and every record in it also contains an "OXY" tag. Troia speculates that this may refer to Wyoming-based data broker Oxydata, which claims to have 4 TB of data, including 380 million profiles on consumers and employees in 85 industries and 195 countries around the world. Martynas Simanauskas, Oxydata director of business-to-business sales, emphasized that Oxydata hasn't suffered a breach and that it does not label its data with an "OXY" tag.
"While the part of the database Vinny found presumably might be acquired from us or one of our customers, it has definitely not been leaked from our database," Simanauskas told WIRED. "We sign the agreements with all our clients that strictly forbids the data reselling and obliges them to ensure that all of the appropriate security measures are taken. However, there is no way for us to enforce all of our clients to follow the best data protection practices and guidelines. Judging from the data structure, it seems clear that the database found by Vinny is a work product of a third party, with entries generated from multiple different sources."
The fact that neither data broker could rule out the possibility that one of their customers mishandled their data speaks to the larger security and privacy issues inherent in the business of buying and selling data.
"What stands out about this incident is the sheer volume of data that’s been collected and how it’s been aggregated, stored, and commercialized without the knowledge of the data owners. My own personal information is in there," says security researcher Troy Hunt, who runs the comprehensive data exposure tracking service HaveIBeenPwned. "We’re definitely seeing more data than ever circulating. It’s not just due to more data breaches, it’s also due to the propagation of data that’s already been breached. We’re seeing that data then taken by other services, duplicated, then breached again."
As with some of his past disclosures, Troia provided information from the trove to Hunt for HaveIBeenPwned. In all, Hunt added more than 622 million unique email addresses and other data to his repository, and is currently notifying the HaveIBeenPwned network.
Neverending Leaks
This data exposure is just the latest in a seemingly endless string of large-scale discoveries. At the beginning of this year, 2.2 billion records were found distributed on hacker forums across several tranches known as Collections #1-5. In March, Troia and Diachenko discovered that a single email marketing firm called Verifications.io had left 809 million records publicly accessible. In 2018 the marketing firm Exactis leaked a database of 340 million personal records, and a breach of the sales intelligence firm Apollo exposed billions of data points.
For the first quarter of 2019, the number of both data breaches and data exposures was up significantly compared to 2018. Troia, who runs the threat intelligence firm Data Viper, says that over the past few years he has been building out a repository of exposed data to use in scanning and tracking. At the end of 2017 he says he was struggling to get 4 billion records into the platform. By March 2018, he had ingested 5 billion. Today he has compiled more than 13 billion. "That’s a huge, massive jump," Troia says.
Just because data is exposed online doesn't mean hackers have accessed it, and often the data involved is simply culled from public records. But in aggregate, these troves can create real risk by enabling identity theft, credential stuffing, and phishing scams. Much of the data also winds up on the dark web, which has seen a recent explosion of stolen credentials, according to recent research from the Swiss IT security testing and dark-web monitoring firm ImmuniWeb.
In one sense, the overwhelming volume of data circulating on the dark web may create a sort of risk plateau where more volume doesn't necessarily equal more successful scams. Then again, those marketplaces are subject to the same forces of supply and demand as any other, says Harrison Van Riper, a strategy and research analyst at the security firm Digital Shadows. As supply goes up, prices go down, making it cheaper for more criminals to get more fodder. Van Riper notes that while passwords, credit card numbers, and government IDs are the most obviously threatening pieces of information for scammers to have, it's important not to underestimate the significance of all the supporting data that helps build out profiles of consumers.
"Some of the public information that might be gathered into one spot is already out there—if you look at the white pages you had somebody’s phone number and you had somebody’s address—it’s just that it’s a lot easier to get access now and exploit it at a mass scale," he says. "Given the proliferation, just how much data is out there, somebody is going to find a way to exploit even the most mundane items of information."