Differential privacy: the Laplace mechanism
We are now going to introduce the Laplace mechanism (see The Algorithmic Foundations of Differential Privacy, Section 3.3).
Index
1) Definition of the Laplacian mechanism
Let be the set of possible rows in a database so that we can represent a particular database using a histogram .
Given a particular function , we define the Laplace mechanism as the collection of conditional probabilities given by
where . We can show that this mechanism is -differentially private with and with the sensitivity defined as
Notice that the Laplace mechanism is a randomization algorithm that depends on the function , which can be regarded as a numeric query. In order to apply this mechanism for a particular value of , we need to compute , which might be hard to do, in practice. Furthermore, there are cases in which is unbounded, so there is no way to make finite. Thus, for practical purposes, we may find a compromise and, instead of using the worst-case-scenario , we can use a "typical" sensitivity. This typical sensitivity can be estimated by taking the databases to be random variables, which allows the distribution of to be computed.
2) Example: mean height
We consider the case in which we want to estimate the mean height of a group of people. However, the people do not want to reveal their exact height, so some Laplace noise is introduced. The first thing we need to do is decide what amount of noise that should added to their true height.
In this case, the database consists simply of one record, height, and the function returns the height. Now, since the height is not bounded, the sensitivity is infinite. However, in practice, humans are typically not more than 2m tall and are never more than 3m tall. So this is a case in which the use of a typical sensitivity is in order.
The amount of noise we should introduce depends on the typical value of the height of the general population. For example, adding noise of about 0.1cm is clearly too little. On the other hand, adding noise of about 1m is too much, in the sense that it would be useless to compute an average height.
The distribution of heights of adult males is normal with mean and . Similarly, for females, the distribution is normal with and . Assuming that there are the same number of males and females, the height distribution would be
The mean of the distribution of heights in the general population is and the variance is given by
2.1) General distribution
Let's calculate the height distribution of the population:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
mu_M = 178
mu_F = 162
sigma_M = 7
sigma_F = 6.5
mu = 1/2 * (mu_M + mu_F)
sigma = np.sqrt(1/2 * (sigma_M**2 + sigma_F**2) + 1/4 * (mu_M - mu_F)**2)
x = np.arange(mu - sigma, mu + sigma, 0.001) # range of x in spec
x_all = np.arange(0, 220, 0.001) # entire range of x, both in and out of spec
y = 1/2 * (norm.pdf(x, mu_M, sigma_M) + norm.pdf(x, mu_F, sigma_F))
y_M = norm.pdf(x_all, mu_M, sigma_M)
y_F = norm.pdf(x_all, mu_F, sigma_F)
y_T = 1/2 * (y_M + y_F)
Visually, the plot shows the height distribution of the populations:
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots(figsize=(9,6))
ax.plot(x_all, y_M, label="Male")
ax.plot(x_all, y_F, label="Female")
ax.plot(x_all, y_T, label="General")
ax.fill_between(x, y, 0, alpha=0.3, color='y', label="$\mu\pm\sigma$")
ax.fill_between(x_all, y_M, 0, alpha=0.1)
ax.fill_between(x_all, y_F, 0, alpha=0.1)
ax.fill_between(x_all, y_T, 0, alpha=0.1)
ax.set_xlim([100, 220])
ax.set_xlabel('height (cm)')
ax.set_yticklabels([])
ax.set_title('Height Distribution')
plt.legend(loc="upper left")
plt.show()
3) Estimating the sensitivity
In order to estimate the typical value of the sensitivity, we can compute its expected value . This gives us
with . In principle, we could use other statistics, such as or some high percentile.
from scipy.special import erf
import math
m = (mu_M - mu_F)/np.sqrt((2*(sigma_M**2 + sigma_F**2)))
df = 1/(2*np.sqrt(math.pi)) * (sigma_M + sigma_F + np.sqrt(2*(sigma_M**2 + sigma_F**2))
* (np.exp(-m**2) + np.sqrt(math.pi)*m*erf(m) ) )
print("Expected sensitivity: " + str(round(df,2)) + " cm")
Expected sensitivity: 11.99 cm
Notice that, in general, it is not possible to compute the expected value (or any other statistic) of exactly. However, the statistics can be estimated using a sampling procedure (see Rubinstein 2017). That is, instead of computing the global sensitivity analytically, we compute an empirical estimation of it by sampling over the dataset. The framework provides a method for sampling the sensitivity.
from shfl.private.utils import unprotected_query
from shfl.differential_privacy.probability_distribution import GaussianMixture
from shfl.differential_privacy.sensitivity_sampler import SensitivitySampler
from shfl.differential_privacy.norm import L1SensitivityNorm
norm_params = np.array([[mu_M, sigma_M],
[mu_F, sigma_F]])
weights = np.ones(2) / 2.0
distribution = GaussianMixture(norm_params, weights)
sampler = SensitivitySampler()
sensitivity, mean = sampler.sample_sensitivity(unprotected_query,
L1SensitivityNorm(),
distribution,
n_data_size=1,
gamma=0.05)
print("Sensitivity from sampling: " + str(round(sensitivity,2)))
print("Mean sensitivity from sampling: " + str(round(mean,2)))
2022-03-23 09:11:16.099876: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-03-23 09:11:16.099891: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Sensitivity from sampling: 47.38
Mean sensitivity from sampling: 11.89
As we can see, the value obtained is approximately the expected value for the sensitivity.
For smaller values of , the amount of privacy is higher, but the utility is lower. For example, if you wish to compute the mean height over a given population, the accuracy of the result would degrade as the amount of privacy increases. This is shown in the experiment below.
4) Experiments using different sensitivities
In particular, we consider a population of 500 people with a mean height . However, these people reported their height with a Laplace noise controlled by and the sensitivity , which is then used to estimate the mean height. As decreases, the error in this estimate increases.
from shfl.differential_privacy.mechanism import LaplaceMechanism
from shfl.private.federated_operation import federate_array
def experiment(sensitivity, epsilon):
true_heights = np.random.normal(170, 1, 500)
true_mean = np.mean(true_heights)
federated_array = federate_array(true_heights, len(true_heights))
# Define allowed access to data.
data_access_definition = LaplaceMechanism(sensitivity, epsilon)
federated_array.configure_data_access(data_access_definition)
# Query data
result = federated_array.query()
observed_mean = np.mean(result)
error = np.abs(true_mean - observed_mean)
return observed_mean, error
def run_n_experiments(sensitivity, epsilon, n_runs):
for run in range(n_runs):
observed_mean, error = experiment(sensitivity, epsilon)
errors[i_df, i_epsilon] = errors[i_df, i_epsilon] + error
return 1/n_runs * errors[i_df, i_epsilon]
With the experiments defined, let's run them to obtain the results regarding different sensitivities:
# Run the experiments:
epsilon_range = np.arange(0.001, 20, 0.05)
df_range = df + [0, sigma, 2*sigma]
errors = np.zeros((len(df_range), len(epsilon_range)), dtype=object)
n_runs = 1
for i_df in range(len(df_range)):
for i_epsilon in range(len(epsilon_range)):
errors[i_df, i_epsilon] = run_n_experiments(df_range[i_df], epsilon_range[i_epsilon], n_runs)
The only thing left now is to visualize these results:
# Plot the figure:
fig, ax = plt.subplots(figsize=(9,6))
labels = ["$\mathbb{E}[\Delta f]$", "$\mathbb{E}[\Delta f] + \sigma$", "$\mathbb{E}[\Delta f] + 2\sigma$"]
for i_df in range(len(df_range)):
ax.plot(epsilon_range, errors[i_df], label = labels[i_df])
ax.set_xlabel('$\epsilon$')
ax.set_ylabel('Error (cm)')
ax.set_yscale('log')
plt.legend(title = "Sensitivity", loc="upper right")
plt.show()
The error is smaller as privacy decreases (i.e. for greater values of ). Moreover, for increasing values of the expected sensitivity , the privacy increases and thus the error grows.