A recent paper, “Estimating the success of re-identifications in incomplete datasets using generative models,” discusses new techniques to re-identify anonymized data. Data re-identification means taking data that has been made anonymous—protected using non-reversible techniques—and figuring out the original values. The paper notes:
Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model.
This is interesting stuff, and applies when considering security of a data protection approach. One important criterion when considering data protection is consistency: whether a given piece of data protects to the same value every time.
When data is protected consistently, it is typically easier to re-identify. A simple example is a county where there are only a handful of households. By correlating something as simple as the number of residents in each, one might identify which set of data belongs to which household. And if, say, birth date is not protected (common, as people use it frequently as a reference point), someone over 100 years old will be instantly identifiable, since there won’t be more than one of them in that ZIP code. Using consistent data protection is very common, especially as the value of analytics grows, because it allows correlation and analysis using just the ciphertext.
If each instance of the protected data is unique, then it is much harder to re-identify. (Note that it is still not necessarily impossible: there may be a distinguishable subset that’s always identifiable, especially with very small domains. Perhaps companies should start seeding data sets with bogus records to avoid this?) However, such protection destroys the ability to do analytics on the data without converting it back to cleartext.
And there are gradations between the two extremes. For example, health information might be “binned”—semi-anonymized—by month/year/decade, or by state, allowing limited analytics, perhaps enabling identification of disease clusters, without risking re-identification of individuals.
Complex re-identification examples often resemble the kind of logic puzzle we grew up with, sometimes called Zebra or Einstein’s puzzles, although I grew up calling them “The Norwegian lives in the blue house” puzzles.
In any case, a lot of this can be made moot. If you protect data properly—that is, you protect enough of the data—then locating a given record usually doesn’t matter. It might if the database is the federal Witness Protection list, or if you’ve got a copy of the client roster of a Swiss bank, but otherwise? The key is to protect enough that even identifying a record doesn’t help an attacker.
Here are my name, address, ZIP code, SSN, birthdate, and Visa number and expiration, protected by format-preserving encryption (FPE):
Khxt Igjqk LMI 26952 Msheudnpue Zw, Qtygyit, TK 17358-1198
826-62-5193 22/07/2015 5498-3332-8253-3709 04/03
Have fun with it. Does knowing that it’s my record do you any good?
I’m also a customer of a large multinational bank. Nobody is going to break into that bank’s system, steal their customer database, and do that analysis just so they can send me spam: they’ll just shotgun the spam at everyone, and if I fall for it because I am that bank’s customer, well, yay. Presumably the bank has enough fields protected that having the database and figuring out which record is mine doesn’t get them very far.
Like so many crypto vulnerabilities, this is a matter of degree. Yes, it’s surprisingly easy to re-identify some data. Yes, there are cases where that could matter. Is it easy to do? No. Does it matter most of the time even if you do it? No. Are there cheaper and easier ways to attack folks’ privacy/accounts? Yes. So I sure don’t lose sleep over it.
Finally, as to the claim that such data protection is “unlikely to satisfy the modern standards for anonymization set forth by GDPR”: this is a legal question more than a technical one. And GDPR has to deal with reality. Companies aren’t going to say “Oh, ok, we won’t collect data”, or “Oh, ok, we will destroy the data so we can’t do analytics on it.”
Note also that the comment mentions “anonymisation”, as opposed to “pseudonymisation”. The A-word means “not reversible” in GDPR-speak, so it’s not encryption: typically it means some form of hashing, such as Voltage SecureData Format-Preserving Hash (FPH), which produces consistent, non-reversible protected values for each cleartext. (In the GDPR lexicon, pseudonymisation is encryption or equivalent reversible data protection.) The practicality of anonymisation beyond something like Format-Preserving Hash—that is, to something that stands up to the kind of scrutiny the paper discusses—is questionable: that is, companies do need to be able to correlate data values. True anonymization means that they cannot do so.
All of this leaves me skeptical that it will be possible to meet the GDPR specifications and continue to do business. At a U.S. Chamber of Commerce forum a couple of weeks ago, the CEO of Allstate said, “We are a data analytics company that happens to sell insurance”. That’s a pretty bold statement. They’re going to be very motivated to fight any attempts to make them destroy their business.
One might consider “consistency” vs. “anonymity” as a knob that you turn: the more consistent the data, the less anonymous. At one end is perfect anonymity; at the other perfect accountability. The question is what point on the dial is sufficient for compliance with HIPAA, GDPR, et al. The answer to this is going to evolve, and I can’t see the EU simply saying, “Thou shalt protect with strong, one-way protection that destroys your analytics”; if they do, the lawsuits could start in earnest!