Differential privacy: basic concepts
There is a common situation in sociology where a researcher wants to develop a study about a certain property of a population. The point is that this property is very sensitive and must remain private because it could be embarrassing or illegal. However, the researcher can gain insight into the population, without compromising their privacy. There is a simple mechanism in which every person follows a randomization algorithm that provides "plausible deniability".
Index
1) Definition of the Randomize Response algorithm
Consider the following algorithm (see The Algorithmic Foundations of Differential Privacy, Section 2.3):
- Flip a coin;
- If tails, then respond truthfully;
- If heads, then flip a second coin and respond "Yes" if heads and "No" if tails.
Even if the user answers "Yes", they could argue that this was due to the randomization algorithm and that it is not a true answer.
Now, imagine that the researcher wants to use the previous algorithm to estimate the average number of people that have broken a particular law. The interesting point here is that they can obtain a very good approximation, while maintaining the privacy of every individual. Let's look at a specific example to get more familiar with the situation.
First, we are going to generate a binary random vector with a concrete mean given by .
import numpy as np
import matplotlib.pyplot as plt
import shfl
p = 0.2
data_size = 10000
array = np.random.binomial(1,p,size=data_size)
np.mean(array)
2022-03-24 12:18:05.487010: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-03-24 12:18:05.487034: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
/home/f.palomino/Desktop/venvpruebas/environment/lib/python3.8/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
0.2061
We now show the first 100 elements:
array[0:100]
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,
1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
Let's suppose that every number represents whether the person has broken the law (value=1) or not (value=0). Now, we are going to generate federated data from this array. Every node is assigned one value, simulating, for instance, a piece of information on their mobile device.
from math import log, exp
federated_array = shfl.private.federated_operation.federate_array(array, data_size)
Now, we want every node to execute the previously defined algorithm and to return the result.
from shfl.differential_privacy.mechanism import RandomizedResponseCoins
# Define allowed access to data.
data_access_definition = RandomizedResponseCoins()
federated_array.configure_data_access(data_access_definition)
# Query data
result = federated_array.query()
Let's compare the mean of the original vector with the mean of the returned vector.
print("Generated binary vector with mean: " + str(np.mean(array)))
print("Differential query mean result: " + str(np.mean(result)))
Generated binary vector with mean: 0.2061
Differential query mean result: 0.3492
What happened? Obviously, we have modified the true mean of the vector applying the randomization algorithm. We need to remove the influence of the latter to get the correct result. Fortunately, we can reverse the process to get a good estimate.
We are introducing noise, half of the time, with mean 0.5. So, the expected value will be:
Expected estimated mean = actual mean * 0.5 + random mean * 0.5
np.mean(array) * 0.5 + 0.5*0.5
0.35305
Pretty close, isn't it? To get the corrected estimated mean value, we just need to use the following expression:
Corrected estimated mean = (estimated mean - random mean * 0.5) / 0.5
(np.mean(result) - 0.5*0.5)/0.5
0.19840000000000002
Right! We have obtained an estimate that is very similar to the true mean. However, introducing randomness comes at a cost. In the following sections, we are going to formalize the error introduced in the estimation by the differential privacy mechanism and study the effect of the population size and the privacy on the algorithm's performance.
2) Definition of differential privacy
We are now going to introduce the notion of differential privacy (see The Algorithmic Foundations of Differential Privacy, Definition 2.4).
Let be the set of possible rows in a database, so that we can represent a particular database using a histogram . A randomized algorithm is a collection of conditional probabilities for all and , where is the response space.
Differential privacy is a property of some randomized algorithms. In particular, a randomized algorithm is -differentially private if
for all and for all databases such that .
In words, this definition means that for any pair of similar databases (i.e., differing in one element only), the probability of getting the same result after randomization is similar (i.e., probabilistically bounded).
3) Digging into the randomized response
Let's get back to sociological studies.
Suppose a group of people take part in a study that wants to estimate the proportion of the population that commits fraud when paying their taxes. Since this is a crime, the participants might be worried about the consequences of telling the truth, so they are told to follow the algorithm described previously.
This procedure is, in fact, an -differentially private randomized mechanism. In particular, given this differential privacy mechanism, we have that
and a direct computation shows that .
The probability of responding "Yes" is given by
where is the quantity of interest in the study. Using that is estimated to be , we have an estimate for the proportion of the population that commits fraud, namely
3.1) Trade-off between accuracy and privacy
Now, we are going to create a set of useful functions to execute some experiments and try to understand better the behavior of the previous expressions. The framework provides an implementation of this algorithm using two parameters, namely the probabilities of getting "heads" in each of the two coin tosses. For simplicity, we are going to consider the case in which the first coin is biased but the second one remains fair.
More explicitly, we consider the conditional probabilities given by
with . This corresponds to taking the first coin with and the second coin to be unbiased. In this case, the amount of privacy depends on the bias of the first coin,
For , we have that and the algorithm is maximally private. On the other hand, for , tends to infinity, so the algorithm is not private at all.
The relationship between the number of positive responses and is given by
In order to properly see the trade-off between utility and privacy, we may look at the uncertainty of the estimate, which is given by
This expression shows that, as the privacy increases ( approaches ), the uncertainty of the estimate increases. On the other hand, for , the uncertainty of the estimate is purely due to the finite size of the sample . In general, we see that the price we pay for being private is in terms of the accuracy of the estimate (for a fixed sample size). The expression for the uncertainty is computed using the normal approximation to the error of a binomial distribution.
def get_prob_from_epsilon(epsilon):
f = np.exp(epsilon) / (1 + np.exp(epsilon))
return 2*(1-f)
We also define the uncertainty function.
def uncertainty(dp_mean, n, epsilon):
f = np.exp(epsilon) / (1 + np.exp(epsilon))
estimation_uncertainty = 2/(2*f-1) * np.sqrt(dp_mean*(1-dp_mean) / n)
return estimation_uncertainty
Finally, we define a function to execute the experiments times so that we can study the average behavior.
def experiment(epsilon, p, size):
array = np.random.binomial(1,p,size=size)
federated_array = shfl.private.federated_operation.federate_array(array, size)
prob = get_prob_from_epsilon(epsilon)
# Define allowed access to data.
data_access_definition = RandomizedResponseCoins(prob_head_first=prob)
federated_array.configure_data_access(data_access_definition)
# Query data
result = federated_array.query()
estimated_mean = (np.mean(result) - prob * 0.5)/(1-prob)
estimation_uncertainty = uncertainty(np.mean(result), size, epsilon)
return estimated_mean, estimation_uncertainty
def run_n_experiments(epsilon, p, size, n_runs):
uncertainties = 0
p_est = 0
for i in range(n_runs):
estimated_mean, uncertainty = experiment(epsilon,p,size)
p_est = p_est + estimated_mean
uncertainties = uncertainties + uncertainty
p_est = p_est / n_runs
uncertainties = uncertainties / n_runs
return p_est, uncertainties
3.2) Results
Now, we are going to execute the experiment and save the results.
epsilon_range = np.arange(0.001, 10, 0.1)
n_range = [100, 500, 2000]
p_est = np.zeros((len(n_range), len(epsilon_range)))
uncertainties = np.zeros((len(n_range), len(epsilon_range)))
for i_n in range(len(n_range)):
for i_e in range(len(epsilon_range)):
p_est_i, uncertainty_i = run_n_experiments(epsilon_range[i_e], p, n_range[i_n], 10)
p_est[i_n, i_e] = p_est_i
uncertainties[i_n, i_e] = uncertainty_i
We can now see the results.
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(16,6))
for i_n in range(len(n_range)):
ax[0].plot(epsilon_range, p_est[i_n], label = "N = " + str(n_range[i_n]))
ax[0].set_xlabel('$\epsilon$')
ax[0].set_ylabel('$\hat p$')
ax[0].set_xlim([0., 10])
ax[0].set_ylim([0, 1])
for i_n in range(len(n_range)):
ax[1].plot(epsilon_range, uncertainties[i_n], label = "N = " + str(n_range[i_n]))
ax[1].set_xlabel('$\epsilon$')
ax[1].set_ylabel('$\Delta p$')
ax[1].set_xlim([0., 10])
ax[1].set_ylim([0.01, 100])
ax[1].set_yscale('log')
plt.legend(title = "Number of samples", loc="upper right")
plt.show()
As we can see on the graphic of the left, the accuracy of the estimated proportion increases as privacy decreases, i.e. with higher values of . Bearing in mind that the true proportion is , when we are working with low values of , (in this case, around 0 and 2) the prediction is far from the true value. On the other hand, when we work with high values of , meaning less privacy, the estimated proportion is near the real value. Moreover, larger sample sizes are associated to higher accuracy of the estimated proportion as it can be seen when values are low, thus with high privacy.
As for the graphic of the right, we can see that for a fixed sample size , the uncertainty in the estimate decreases as privacy decreases, i.e. with higher values. In any case, never reaches the value 0, highlighting that some uncertainty is always going to be present. Moreover, larger sample sizes are associated to lower uncertainty . This is logical, since if you have less data, changing one element of your dataset will affect more to the mean value of that variable that if you have more data.