Differential privacy: basic concepts

There is a common situation in sociology where a researcher wants to develop a study about a certain property of a population. The point is that this property is very sensitive and must remain private because it could be embarrassing or illegal. However, the researcher can gain insight into the population, without compromising their privacy. There is a simple mechanism in which every person follows a randomization algorithm that provides "plausible deniability".

Index

1) Definition of the Randomize Response algorithm

Consider the following algorithm (see The Algorithmic Foundations of Differential Privacy, Section 2.3):

  1. Flip a coin;
  2. If tails, then respond truthfully;
  3. If heads, then flip a second coin and respond "Yes" if heads and "No" if tails.

Even if the user answers "Yes", they could argue that this was due to the randomization algorithm and that it is not a true answer.

Now, imagine that the researcher wants to use the previous algorithm to estimate the average number of people that have broken a particular law. The interesting point here is that they can obtain a very good approximation, while maintaining the privacy of every individual. Let's look at a specific example to get more familiar with the situation.

First, we are going to generate a binary random vector with a concrete mean given by pp.

import numpy as np
import matplotlib.pyplot as plt
import shfl

p = 0.2
data_size = 10000
array = np.random.binomial(1,p,size=data_size)
np.mean(array)
2022-03-24 12:18:05.487010: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-03-24 12:18:05.487034: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
/home/f.palomino/Desktop/venvpruebas/environment/lib/python3.8/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm



0.2061

We now show the first 100 elements:

array[0:100]
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Let's suppose that every number represents whether the person has broken the law (value=1) or not (value=0). Now, we are going to generate federated data from this array. Every node is assigned one value, simulating, for instance, a piece of information on their mobile device.

from math import log, exp

federated_array = shfl.private.federated_operation.federate_array(array, data_size)

Now, we want every node to execute the previously defined algorithm and to return the result.

from shfl.differential_privacy.mechanism import RandomizedResponseCoins

# Define allowed access to data. 
data_access_definition = RandomizedResponseCoins()
federated_array.configure_data_access(data_access_definition)

# Query data
result = federated_array.query()

Let's compare the mean of the original vector with the mean of the returned vector.

print("Generated binary vector with mean: " + str(np.mean(array)))
print("Differential query mean result: " + str(np.mean(result)))
Generated binary vector with mean: 0.2061
Differential query mean result: 0.3492

What happened? Obviously, we have modified the true mean of the vector applying the randomization algorithm. We need to remove the influence of the latter to get the correct result. Fortunately, we can reverse the process to get a good estimate.

We are introducing noise, half of the time, with mean 0.5. So, the expected value will be:

Expected estimated mean = actual mean * 0.5 + random mean * 0.5

np.mean(array) * 0.5 + 0.5*0.5
0.35305

Pretty close, isn't it? To get the corrected estimated mean value, we just need to use the following expression:

Corrected estimated mean = (estimated mean - random mean * 0.5) / 0.5

(np.mean(result) - 0.5*0.5)/0.5
0.19840000000000002

Right! We have obtained an estimate that is very similar to the true mean. However, introducing randomness comes at a cost. In the following sections, we are going to formalize the error introduced in the estimation by the differential privacy mechanism and study the effect of the population size and the privacy on the algorithm's performance.

2) Definition of differential privacy

We are now going to introduce the notion of differential privacy (see The Algorithmic Foundations of Differential Privacy, Definition 2.4).

Let X\mathcal X be the set of possible rows in a database, so that we can represent a particular database using a histogram xNXx\in\mathbb N^{|\mathcal X|}. A randomized algorithm is a collection of conditional probabilities P(zx)P(z| x) for all xx and zZz\in \mathcal Z, where Z\mathcal Z is the response space.

Differential privacy is a property of some randomized algorithms. In particular, a randomized algorithm is ϵ\epsilon-differentially private if

P(zx)P(zy)eϵ\frac{P(z|x)}{P(z|y)}\le e^\epsilon

for all zZz\in \mathcal Z and for all databases x,yNXx, y \in\mathbb N^{|\mathcal X|} such that xy1=1||x-y||_1=1.

In words, this definition means that for any pair of similar databases (i.e., differing in one element only), the probability of getting the same result after randomization is similar (i.e., probabilistically bounded).

3) Digging into the randomized response

Let's get back to sociological studies.

Suppose a group of NN people take part in a study that wants to estimate the proportion of the population that commits fraud when paying their taxes. Since this is a crime, the participants might be worried about the consequences of telling the truth, so they are told to follow the algorithm described previously.

This procedure is, in fact, an ϵ\epsilon-differentially private randomized mechanism. In particular, given this differential privacy mechanism, we have that

P(respond yesactual yes)=34P(respond noactual yes)=14P(\text{respond yes} \,|\, \text{actual yes}) = \frac{3}{4}\qquad \qquad P(\text{respond no} \,|\, \text{actual yes}) = \frac{1}{4}
P(respond yesactual no)=14P(respond noactual no)=34,P(\text{respond yes} \,|\, \text{actual no}) = \frac{1}{4}\qquad \qquad P(\text{respond no} \,|\, \text{actual no}) = \frac{3}{4}\,,

and a direct computation shows that ϵ=log(3)\epsilon = \log(3).

The probability of responding "Yes" is given by

P(respond yes)=14+p2P(\text{respond yes}) = \frac{1}{4} + \frac{p}{2}

where p=P(actual yes)p = P(\text{actual yes}) is the quantity of interest in the study. Using that P(respond yes)P(\text{respond yes}) is estimated to be rP=#(respond yes)Nr_P = \frac{\#{\text{(respond yes)}}}{N}, we have an estimate for the proportion of the population that commits fraud, namely

p^=2rP12.\hat p = 2r_P - \frac{1}{2}\,.

3.1) Trade-off between accuracy and privacy

Now, we are going to create a set of useful functions to execute some experiments and try to understand better the behavior of the previous expressions. The framework provides an implementation of this algorithm using two parameters, namely the probabilities of getting "heads" in each of the two coin tosses. For simplicity, we are going to consider the case in which the first coin is biased but the second one remains fair.

More explicitly, we consider the conditional probabilities given by

P(respond yesactual yes)=fP(\text{respond yes} | \text{actual yes}) = f
P(respond yesactual no)=1f.P(\text{respond yes} | \text{actual no}) = 1-f \,.

with f[1/2,1]f\in [1/2,1]. This corresponds to taking the first coin with p(heads)=2(1f)p({\rm{heads}}) = 2(1-f) and the second coin to be unbiased. In this case, the amount of privacy depends on the bias of the first coin,

ϵ=logf1f.\epsilon = \log \frac{f}{1-f}\,.

For f=1/2f=1/2, we have that ϵ=0\epsilon = 0 and the algorithm is maximally private. On the other hand, for f=1f = 1, ϵ\epsilon tends to infinity, so the algorithm is not private at all.

The relationship between the number of positive responses and pp is given by

p^=rP+f12f1.\hat p = \frac{r_P + f - 1}{2f-1}\,.

In order to properly see the trade-off between utility and privacy, we may look at the uncertainty of the estimate, which is given by

Δp=22f1rP(1rP)N.\Delta p = \frac{2}{2f-1}\sqrt{\frac{r_P(1-r_P)}{N}}\,.

This expression shows that, as the privacy increases (ff approaches 1/21/2), the uncertainty of the estimate increases. On the other hand, for f=1f=1, the uncertainty of the estimate is purely due to the finite size of the sample NN. In general, we see that the price we pay for being private is in terms of the accuracy of the estimate (for a fixed sample size). The expression for the uncertainty is computed using the normal approximation to the error of a binomial distribution.

def get_prob_from_epsilon(epsilon):
    f = np.exp(epsilon) / (1 + np.exp(epsilon))
    return 2*(1-f)

We also define the uncertainty function.

def uncertainty(dp_mean, n, epsilon):
    f = np.exp(epsilon) / (1 + np.exp(epsilon))
    estimation_uncertainty = 2/(2*f-1) * np.sqrt(dp_mean*(1-dp_mean) / n)
    return estimation_uncertainty

Finally, we define a function to execute the experiments nrunsn_{\rm{runs}} times so that we can study the average behavior.

def experiment(epsilon, p, size):
    array = np.random.binomial(1,p,size=size)
    federated_array = shfl.private.federated_operation.federate_array(array, size)
    
    prob = get_prob_from_epsilon(epsilon)
    # Define allowed access to data. 
    data_access_definition = RandomizedResponseCoins(prob_head_first=prob)
    federated_array.configure_data_access(data_access_definition)
    # Query data
    result = federated_array.query()
    
    estimated_mean = (np.mean(result) - prob * 0.5)/(1-prob)
    estimation_uncertainty = uncertainty(np.mean(result), size, epsilon)
    return estimated_mean, estimation_uncertainty

def run_n_experiments(epsilon, p, size, n_runs):
    uncertainties = 0
    p_est = 0
    for i in range(n_runs):
        estimated_mean, uncertainty = experiment(epsilon,p,size)
        p_est = p_est + estimated_mean
        uncertainties = uncertainties + uncertainty
    p_est = p_est / n_runs
    uncertainties = uncertainties / n_runs    
    return p_est, uncertainties

3.2) Results

Now, we are going to execute the experiment and save the results.

epsilon_range = np.arange(0.001, 10, 0.1)
n_range = [100, 500, 2000]
p_est = np.zeros((len(n_range), len(epsilon_range)))
uncertainties = np.zeros((len(n_range), len(epsilon_range)))
for i_n in range(len(n_range)):
    for i_e in range(len(epsilon_range)):
        p_est_i, uncertainty_i = run_n_experiments(epsilon_range[i_e], p, n_range[i_n], 10)
        p_est[i_n, i_e] = p_est_i
        uncertainties[i_n, i_e] = uncertainty_i

We can now see the results.

plt.style.use('fivethirtyeight')
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(16,6))
for i_n in range(len(n_range)):
    ax[0].plot(epsilon_range, p_est[i_n], label = "N = " + str(n_range[i_n]))   
ax[0].set_xlabel('$\epsilon$')
ax[0].set_ylabel('$\hat p$')
ax[0].set_xlim([0., 10])
ax[0].set_ylim([0, 1])
    
for i_n in range(len(n_range)):
    ax[1].plot(epsilon_range, uncertainties[i_n], label = "N = " + str(n_range[i_n]))
ax[1].set_xlabel('$\epsilon$')
ax[1].set_ylabel('$\Delta p$')
ax[1].set_xlim([0., 10])
ax[1].set_ylim([0.01, 100])
ax[1].set_yscale('log')
plt.legend(title = "Number of samples", loc="upper right")
plt.show()

png

As we can see on the graphic of the left, the accuracy of the estimated proportion p^\hat{p} increases as privacy decreases, i.e. with higher values of ϵ\epsilon. Bearing in mind that the true proportion is p=0.2p=0.2, when we are working with low values of ϵ\epsilon, (in this case, around 0 and 2) the prediction is far from the true value. On the other hand, when we work with high values of ϵ\epsilon, meaning less privacy, the estimated proportion is near the real value. Moreover, larger sample sizes NN are associated to higher accuracy of the estimated proportion p^\hat{p} as it can be seen when ϵ\epsilon values are low, thus with high privacy.

As for the graphic of the right, we can see that for a fixed sample size NN, the uncertainty in the estimate Δp\Delta p decreases as privacy decreases, i.e. with higher ϵ\epsilon values. In any case, Δp\Delta p never reaches the value 0, highlighting that some uncertainty is always going to be present. Moreover, larger sample sizes NN are associated to lower uncertainty Δp\Delta p. This is logical, since if you have less data, changing one element of your dataset will affect more to the mean value of that variable that if you have more data.

;