A mathematical theory of privacy

In the last decade we have seen an exponential growth in the collection of data and much of this data must be kept private: from personal medical records to the telemetry collected by an industrial pasta drier. At the same time the benefits of sharing that data are huge: better scientific research, better decision making and better policy making.

From this conflicting goals arises the question of how to share data while preserving privacy.

The old insecure methods

The most naive approach is anonymization: removing from the dataset the identifiable information, e.g. name and address. Superficially this seems like enough protection for the privacy of the users from which the data was collected, but in many cases publishing anonymized data is not much better than publishing the whole dataset; the key reason is that we are unique in many ways: the times and stations at which we take the metro, the groceries we buy, the books we read, …

For this reason many anonymized dataset can and have been deanonymized, this issue is only worsened by the fact that nowadays recording thousands of datapoints per user is easily achievable. For example an anonymized browsing history dataset has been successfully deanonymized by Andreas Dewes and Svea Eckert to reveal the medication taken by a specific german judge.

The fact that anonymization is not very good at protecting privacy has been known for a long time, that's why many companies and institutions release summary statistics (e.g. minimums, maximums, averages, …) and keep the source dataset private.

You would be excused to think that this surely is enough to protect the privacy, but you would be wrong: scientists have shown that if you reveal enough summary statistics the original data can be reconstructed, in particular if the data is categorical, meaning that it can only have a few values, for example male/female or a rating in stars.

Confronted with the fact that these common sense techniques to protect privacy are ultimately flawed many companies and institutions have simply stopped releasing data in fear of compromising the privacy of their users, but this comes at the heavy cost of stifling research and innovation.

A modern approach to privacy

The first key idea that underpins the modern approach to privacy is adding errors to the data. A very simple application of this concept is the randomized response method, which dates back to the '60s and was used for example to estimate how common tax evasion is.

The method prescribes that before answering the question "Did you evade any taxes in the last year?" the participant should flip a coin, if the result is heads then they should respond truthfully otherwise they should flip the coin again to choose the answer.

This simple algorithm guarantees that the participant has plausible deniability: even if the response is yes the participant can reasonably claim that they did not evade taxes. At the same time the tax evasion rate can still be derived from the responses of the participants.

Notice that revealing the responses to the survey is still a privacy loss, because we now know that with 50% probability your answer is the reported answer. We can change how often participants give the truthful answer to increase the privacy, but that also makes the overall estimate of the tax evasion rate less accurate.

This shows why we should think of privacy as a spectrum rather than a binary: we can engineer ways to protect your privacy up to some probability and tune them to strike the desired balance between privacy and accuracy of the data.

Adding errors to the public data and privacy as a spectrum are the fundamental ideas in the field of differential privacy, that is the rigorous mathematical study of how to reveal data while ensuring that the privacy loss is limited.

In the last decade researchers in this field have developed a wide array of methods following this key principles and they are finally getting traction outside the academia; for example the latest US census has been released guaranteeing privacy of the respondents thanks to differential privacy techniques.

The future of privacy

As of today the biggest obstacle for a widespread adoption of differential privacy is the mathematical complexity of understanding the methods and the technical complexity of implementing the algorithms.

Microsoft and Harvard are collaborating on the Open Differential Privacy project to solve this problem: they are building a tool that automatically uses the correct algorithms to protect private information. In a not so distant future ensuring privacy may become as easy as clicking a button in Excel.