May 14, 2019
By Robin Röhm
Why Differential Privacy?
“Anonymized data” is not anonymous
Data governs the modern world. We live in an era where personal and sensitive data is captured whenever and wherever possible. Concerns about data privacy are frequently met with the statement that the data is anonymized – any identifying features were removed from the data. Anonymization techniques like these are the most common data-handling practices in an attempt to eliminate privacy risks, and they are almost always used even by companies that specialize in data privacy techniques. The reality is that it is shockingly easy to “de-identify” or “de-anonymize” an anonymized dataset. Numerous examples, most prominently the case study of an anonymized Netflix dataset, reveal how little auxiliary information is needed to de-identify a person from an “anonymized” dataset.
Recent advancements in machine learning & deep learning intensify this problem. De-identification attacks are so sophisticated today that the concept of Personal Identifiable Information has no technical meaning anymore - essentially any information can be personally identifying.
Appropriate use of personal information is governed by data privacy laws. According to the European General Data Protection Regulation (GDPR) any data is considered personal, if someone can extract and identify a predicate that belongs to exactly one person by using the output of a data release mechanism. It can easily be proven that most anonymization techniques do not withstand sophisticated de-anonymization attacks, and are therefore not compatible with the legal requirements of GDPR (Cohen & Nissim).
Due to these problems, more sophisticated techniques are strictly needed. Fortunately, a new and very powerful concept of data privacy is offering help. The promising technique is called Differential Privacy and attains its power from the underlying foundation of complex mathematics.
Differential Privacy (DP)
With the media attention that Differential Privacy (DP) has received through the deployment by Google, Apple, Microsoft, the US Census Bureau and more, customers and investors increasingly ask about DP. At apheris AI, we have noticed that there are ambiguities and misconceptions about DP, which is why we want to explain it for a non-technical audience in a simple manner in this article.
What is Differential Privacy?
In simple words DP promises the following: you as someone who donates his data will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other information sources are available. DP provides a mathematical guarantee that any outcome of a data analysis cannot be tied back to any individual’s data in a dataset (Dwork).
To be more precise, suppose you are working with data. Then every data analysis has three components, a data input, the analysis itself, and an output:
The analysis can be pretty much anything, for example:
computing some statistic ("how many patients have a certain disease")
an anonymization strategy ("remove names and last digits of Social Security Number")
a machine learning algorithm ("build a model to predict which patients have a certain disease")
Traditional definitions of privacy have focused on the output (e.g., all data anonymization concepts are based on properties of the output data). In contrast to that, DP is a property of the analysis. A function is considered differentially private, if the output for any two datasets that differ only in one row (e.g., differ in one patient) is basically the same. So what does "basically the same" mean in this context? It means that the probability distributions are similar – you obtain the same output of your analysis with almost the same probability, regardless of whether you use dataset 1 or dataset 2.
In other words: DP mathematically guarantees that anyone examining the output of a differentially private analysis draws essentially the same conclusion whether or not an individual’s private information is included in the input to the analysis (Nissim et al.).
If you think about the above definition, you can see that DP is a very intuitive definition of privacy. Under DP, an algorithm is considered private if individual entries in an input database do not change the output of the analysis significantly. This means that any adversary who has access to the output of the analysis cannot really tell if the data of one specific person was in the initial input database and consequently, the output of the analysis doesn’t reveal anything about a specific individual.
Why is Differential Privacy important?
Attack models become obsolete
If you develop a cryptographic protocol that shall secure a database or an analysis, you usually make assumptions about which knowledge and capabilities a potential attacker might have. The beautiful thing with DP is that this becomes obsolete. If a function is differentially private, any type of information about an individual is protected and this works no matter what the attacker knows about your data. This is particularly powerful as you do not have to think about the goals and methods of an attacker any longer. This is based on the fact that DP is a property of a function rather than of a data output.
Differentially Private functions have very powerful properties
Quantifiability: you can quantify the greatest possible information gain by the attacker. The corresponding parameter, usually named ε, allows you to specify how to authorize certain analysis while being sure that certain privacy thresholds are not crossed.
Composability: differentially private functions can be executed after each other and are still differentially private. This means that you can construct systems where multiple analyses can be run and the entire system can be quantified in terms of what level of privacy is being achieved.
In reality, these two properties are usually combined to construct a differentially private system with an overall privacy budget that is composed of many individual algorithms which are each differentially private. Such differentially private systems provide mathematical guarantees for the privacy protection of the data and these guarantees are by construction grounded in the system itself.
There are many more mathematical properties which can be exploited to effectively construct systems that achieve a required level of privacy, but covering these would go far beyond the scope of this introductory article.
How to construct differentially private functions / systems?
In order to construct differentially private analyses, you usually obscure the initial analysis by adding random noise to the statistic.
The beauty of this approach is that there are mathematical proofs showing that adding a certain amount of randomness to the result of an analysis produces a differentially private analysis. Since we add randomness to the analysis, the outcome of a differentially private analysis is not exact, but an approximation. Therefore, the goal must always be to construct an analysis where the added noise protects the privacy of individuals and simultaneously still ensures that the results are meaningful.
Challenges in constructing differentially private analyses
Generally speaking, there is no generic framework for data privacy. As in cryptographic protocols, security and privacy can be measured by quantifying the likelihood of a certain event to happen. As mentioned above, the power of DP lies in the fact that any preexisting knowledge of an attacker is irrelevant, since the output of a differentially private analysis is almost independent of any individual’s presence in the input dataset. Nevertheless, no meaningful system can guarantee full privacy. Differential privacy allows a quantification of the level of privacy that can be achieved when running a certain analysis. Specifying this level of privacy not just in mathematical terms, but in ways that match the requirements of a business, is a big challenge.
Another major challenge is finding the right balance between accuracy and privacy of an analysis. For DP, there is a tradeoff between accuracy and privacy: the more noise you add, the more private and the less accurate your analysis becomes. Consequently, it is hugely important to choose the right functions and methods to obscure any output – the goal must always be to guarantee a maximum amount of privacy while minimizing the loss on accuracy. Since differential privacy is a property of the function, building efficient differentially private analyses is a deeply mathematical task that combines the fields of complex analysis, algebra and computational statistics. In most cases it is not trivial and requires deep domain expertise to design and implement an optimal differentially private algorithm that extracts the most accuracy for a task when given a fixed privacy budget.
On top of that, it is easy for implementations to violate DP. Most differentially private algorithms are still developed in academia and haven’t been fully tested in a production environment. In real world implementations, there are further aspects like side-channel attacks that must be taken into account and are often neglected in a research environment. Furthermore, there are many examples of mistaken proofs for DP, so understanding every single step of an algorithm is critical in order to adequately protect the privacy of individuals.
Our work at apheris AI
Our mission at apheris AI is to empower corporate clients to integrate privacy preserving computations into their workflows. We specialize in designing and implementing differentially private systems and are building supportive programming frameworks to facilitate their adoption. Our technology enables the construction of efficient and accurate programs that can be implemented in a real-world production environment. We are experts in privacy preserving technologies and combine this with the ability to understand our customers’ needs in terms of usability and practicability of a system. We thereby enable numerous use cases starting from building privacy compliant analytics and machine learning workflows and ending in integrating data from distributed siloed datasets. Feel free to reach out to me if you want to discuss about applications of differential privacy –