“I will say, that a man must be a d—d fool, who can’t spell a word more than one way.” -Judge Nyrum Reynolds, August 31, 1855
Sometimes prescience knows no bounds. Shortly after Alexander Graham Bell founded the American Telephone and Telegraph Company, a Wyoming County, New York, judge (and neither Andrew Jackson nor Mark Twain as held by popular culture) predicted a major obstacle of implementing telephone directory assistance, a service that would not be introduced for another seventy-five years. The problem? Based on how it sounds, how can one lookup a person’s name, especially if the name’s spelling is unknown?
Fortunately Robert C. Russel and Margaret King answered this question with twelve years to spare by patenting Soundex coding, a way of phonetically representing the sound of a word with the initial letter and digits. Soundex is implemented as an SQL function call in most relational database systems.
At this point, one might wonder how this discussion, while intriguing, is in any way related to database security? Well, in our last post we discussed using partial protection to avoid performance-killing re-identification (decryption) in some use cases. All of which involved digit strings, such as national identity and payment card numbers.
Another method for eliminating re-identification is to use alternate column encoding schemes. Soundex is a great example: protect a person’s name with encryption, full stop. For wildcard searching, however, introduce a new column with the Soundex representation of that name. One ALTER TABLE followed by one UPDATE statement and we’re done! Even better, Soundex is a hashing algorithm, a “one way” function with no mathematical inverse to compute the original input name.
Soundex is a loss-full algorithm, meaning that the fidelity of the original data is not retained in its alternate representation. Loss-full algorithms are quite useful for database security applications when only some fidelity is needed to perform a query while lossless fidelity would either be inefficient or breach security. This is where alternate column encodings shine: the original column is sensitive and a loss-full alternate is useful.
Birth dates are among the most sensitive types of Personally Identifiable Information (PII) stored in a database. These are essential to perform fraud, especially identity theft. Yet understanding age is necessary for marketing and sales analysis: what ages of people buy our products? And just as important, what ages of people do not?
Since some business applications have a legitimate need for a birth date, we often store those in protected form. If we want to avoid re-identification in a query, we can introduce a loss-full alternative encoding: a person’s birth year. Or perhaps year and quarter for more detailed analytics. That information should be sufficient for most analytics, including medical research.
On this last point, recently the University of Chicago and Google were sued for patient privacy violations because the shared data included times and dates of treatment as well as plain text notes. The plaintiffs allege this allows unique identification of themselves. Had those notes been protected and those dates abstracted, perhaps the defendants would be in a better position. It behooves us as database security professionals to consider loss-full alternate encodings to avoid similar situations befalling us.
Finally, let’s consider the question if introducing alternative encodings violates Third Normal Form (3NF), which we’ve said is necessary to promote database security, as Second Normal Form (2NF) is violated when introducing a functional dependency between columns. I would argue no for the following reason: the source column is encrypted and the alternate encoding cannot be derived without access to a decryption key stored outside the database. If you have another view on this topic, or anything else in this entire discussion, please post your thoughts in the comments below.
That’s it for alternative coding schemes. Next time, we’ll discuss how to perform discrete range searches on encrypted data.
 Per Garson. “No Respect for a Man Who Can Spell a Word Only One Way.” Quote Investigator, 12 Mar. 2019, quoteinvestigator.com/2010/06/25/spelling/#more-760.
Data security and encryption