Even when names and other personal information have been stripped from big data sets, as few as four pieces of random information may be enough to identify a specific person, according to a study to be published soon in the journal Science.
The magazine explains:
[The scientists] analyzed 3 months of credit card transactions, chronicling the spending of 1.1 million people in 10,000 shops in a single country. … The bank stripped away names, credit card numbers, shop addresses, and even the exact times of the transactions. All that remained were the metadata: amounts spent, shop type—restaurant, gym, or grocery store, for example—and a code representing each person.
But because each individual's spending pattern is unique, the data have a very high “unicity.” … To reveal a person's identity, you just need to correlate the metadata with information about the person from an outside source.
The less layperson-friendly abstract of the study puts it this way:
Large-scale data sets of human behavior have the potential to fundamentally transform the way we fight diseases, design cities, or perform research. Metadata, however, contain sensitive information. Understanding the privacy of these data sets is key to their broad use and, ultimately, their impact. We study 3 months of credit card records for 1.1 million people and show that four spatiotemporal points are enough to uniquely reidentify 90% of individuals. We show that knowing the price of a transaction increases the risk of reidentification by 22%, on average. Finally, we show that even data sets that provide coarse information at any or all of the dimensions provide little anonymity and that women are more reidentifiable than men in credit card metadata.