Please note that these tutorials refer to a deprecated version of the platform. The current version of the platform, which is not publicly available, has a more advanced architecture and provides a wider range of functionalities. These tutorials are only for illustrative purposes and showcase a limited number of the platform’s capabilities.

Differential privacy: the Laplace mechanism

We are now going to introduce the Laplace mechanism (see The Algorithmic Foundations of Differential Privacy, Section 3.3).

Index

1) Definition of the Laplacian mechanism

Let X\mathcal X be the set of possible rows in a database so that we can represent a particular database using a histogram xNXx\in\mathbb N^{|\mathcal X|}.

Given a particular function f:NXRkf:\mathbb N^{|\mathcal X|}\rightarrow \mathbb R^k, we define the Laplace mechanism as the collection of conditional probabilities given by

p(zf(x);b)=i=1k12be1bf(x)izip(z|f(x); b) = \prod_{i=1}^k \frac{1}{2b} e^{-\frac{1}{b}| f(x)_i - z_i| }

where zRkz\in\mathbb R^k. We can show that this mechanism is ϵ\epsilon-differentially private with ϵ=Δfb\epsilon = \frac{\Delta f}{b} and with the sensitivity Δf\Delta f defined as

Δf=maxxy1=1f(x)f(y)1.\Delta f = \underset{||x-y||_1 = 1}{\rm max} ||f(x) - f(y)||_1\,.

Notice that the Laplace mechanism is a randomization algorithm that depends on the function ff, which can be regarded as a numeric query. In order to apply this mechanism for a particular value of ϵ\epsilon, we need to compute Δf\Delta f, which might be hard to do, in practice. Furthermore, there are cases in which Δf\Delta f is unbounded, so there is no way to make ϵ\epsilon finite. Thus, for practical purposes, we may find a compromise and, instead of using the worst-case-scenario Δf\Delta f, we can use a "typical" sensitivity. This typical sensitivity can be estimated by taking the databases x,yx, y to be random variables, which allows the distribution of f(x)f(y)1||f(x) - f(y)||_1 to be computed.

2) Example: mean height

We consider the case in which we want to estimate the mean height of a group of people. However, the people do not want to reveal their exact height, so some Laplace noise is introduced. The first thing we need to do is decide what amount of noise that should added to their true height.

In this case, the database consists simply of one record, height, and the function ff returns the height. Now, since the height is not bounded, the sensitivity Δf\Delta f is infinite. However, in practice, humans are typically not more than 2m tall and are never more than 3m tall. So this is a case in which the use of a typical sensitivity is in order.

The amount of noise we should introduce depends on the typical value of the height of the general population. For example, adding noise of about 0.1cm is clearly too little. On the other hand, adding noise of about 1m is too much, in the sense that it would be useless to compute an average height.

The distribution of heights of adult males is normal with mean μM=178cm\mu_M = 178\,{\rm cm} and σM=7cm\sigma_M = 7\,{\rm cm}. Similarly, for females, the distribution is normal with μF=162cm\mu_F = 162\,{\rm cm} and σF=6.5cm\sigma_F = 6.5\,{\rm cm}. Assuming that there are the same number of males and females, the height distribution would be

p(h)=12[N(hμM,σM2)+N(hμF,σF2)].p(h) = \frac{1}{2}\left [ \mathcal N \left (h\,|\,\mu_M, \sigma_M^2\right ) + \mathcal N\left (h\,|\,\mu_F, \sigma_F^2\right ) \right ]\,.

The mean of the distribution of heights in the general population is μ=12(μM+μF)\mu = \frac{1}{2}(\mu_M + \mu_F) and the variance is given by

σ2=12(σM2+σF2)+14(μMμF)2.\sigma^2 = \frac{1}{2}\left ( \sigma_M^2 + \sigma_F^2 \right ) + \frac{1}{4}\left ( \mu_M - \mu_F \right )^2\,.

2.1) General distribution

Let's calculate the height distribution of the population:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

mu_M = 178
mu_F = 162
sigma_M = 7
sigma_F = 6.5

mu = 1/2 * (mu_M + mu_F)
sigma = np.sqrt(1/2 * (sigma_M**2 + sigma_F**2) + 1/4 * (mu_M - mu_F)**2)

x = np.arange(mu - sigma, mu + sigma, 0.001) # range of x in spec
x_all = np.arange(0, 220, 0.001) # entire range of x, both in and out of spec
y = 1/2 * (norm.pdf(x, mu_M, sigma_M) + norm.pdf(x, mu_F, sigma_F))
y_M = norm.pdf(x_all, mu_M, sigma_M)
y_F = norm.pdf(x_all, mu_F, sigma_F)
y_T = 1/2 * (y_M + y_F)

Visually, the plot shows the height distribution of the populations:

plt.style.use('fivethirtyeight')
fig, ax = plt.subplots(figsize=(9,6))
ax.plot(x_all, y_M, label="Male")
ax.plot(x_all, y_F, label="Female")
ax.plot(x_all, y_T, label="General")

ax.fill_between(x, y, 0, alpha=0.3, color='y', label="$\mu\pm\sigma$")
ax.fill_between(x_all, y_M, 0, alpha=0.1)
ax.fill_between(x_all, y_F, 0, alpha=0.1)
ax.fill_between(x_all, y_T, 0, alpha=0.1)
ax.set_xlim([100, 220])
ax.set_xlabel('height (cm)')
ax.set_yticklabels([])
ax.set_title('Height Distribution')
plt.legend(loc="upper left")

plt.show()

png

3) Estimating the sensitivity

In order to estimate the typical value of the sensitivity, we can compute its expected value E[Δf]\mathbb E[\Delta f]. This gives us

E[Δf]=12π(σM+σF+2(σM2+σF2)(em2+πmerf(m)))\mathbb E[\Delta f] = \frac{1}{2\sqrt{\pi}}\left ( \sigma_M + \sigma_F + \sqrt{2(\sigma_M^2 + \sigma_F^2)}\left (e^{-m^2} + \sqrt{\pi}\,m \,{\rm erf}(m)\right ) \right )

with m=μMμF2(σM2+σF2)m = \frac{\mu_M - \mu_F}{\sqrt{2(\sigma_M^2 + \sigma_F^2)}}. In principle, we could use other statistics, such as E[Δf]+σ\mathbb E[\Delta f] + \sigma or some high percentile.

from scipy.special import erf
import math

m = (mu_M - mu_F)/np.sqrt((2*(sigma_M**2 + sigma_F**2)))
df = 1/(2*np.sqrt(math.pi)) * (sigma_M + sigma_F + np.sqrt(2*(sigma_M**2 + sigma_F**2)) 
                               * (np.exp(-m**2) + np.sqrt(math.pi)*m*erf(m) ) )

print("Expected sensitivity: " + str(round(df,2)) + " cm")
Expected sensitivity: 11.99 cm

Notice that, in general, it is not possible to compute the expected value (or any other statistic) of Δf\Delta f exactly. However, the statistics can be estimated using a sampling procedure (see Rubinstein 2017). That is, instead of computing the global sensitivity Δf\Delta f analytically, we compute an empirical estimation of it by sampling over the dataset. The framework provides a method for sampling the sensitivity.

from shfl.private.utils import unprotected_query
from shfl.differential_privacy.probability_distribution import GaussianMixture
from shfl.differential_privacy.sensitivity_sampler import SensitivitySampler
from shfl.differential_privacy.norm import L1SensitivityNorm

norm_params = np.array([[mu_M, sigma_M],
                        [mu_F, sigma_F]])

weights = np.ones(2) / 2.0
distribution = GaussianMixture(norm_params, weights)

sampler = SensitivitySampler()
sensitivity, mean = sampler.sample_sensitivity(unprotected_query, 
                                               L1SensitivityNorm(), 
                                               distribution, 
                                               n_data_size=1, 
                                               gamma=0.05)
print("Sensitivity from sampling: " + str(round(sensitivity,2)))
print("Mean sensitivity from sampling: " + str(round(mean,2)))
2022-03-23 09:11:16.099876: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-03-23 09:11:16.099891: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


Sensitivity from sampling: 47.38
Mean sensitivity from sampling: 11.89

As we can see, the value obtained is approximately the expected value for the sensitivity.

For smaller values of ϵ\epsilon , the amount of privacy is higher, but the utility is lower. For example, if you wish to compute the mean height over a given population, the accuracy of the result would degrade as the amount of privacy increases. This is shown in the experiment below.

4) Experiments using different sensitivities

In particular, we consider a population of 500 people with a mean height truemeantrue_mean. However, these people reported their height with a Laplace noise controlled by ϵ\epsilon and the sensitivity Δf\Delta f, which is then used to estimate the mean height. As ϵ\epsilon decreases, the error in this estimate increases.

from shfl.differential_privacy.mechanism import LaplaceMechanism
from shfl.private.federated_operation import federate_array

def experiment(sensitivity, epsilon):
    true_heights = np.random.normal(170, 1, 500)
    true_mean = np.mean(true_heights)
    
    federated_array = federate_array(true_heights, len(true_heights))
    
    # Define allowed access to data. 
    data_access_definition = LaplaceMechanism(sensitivity, epsilon)
    federated_array.configure_data_access(data_access_definition)
    
    # Query data
    result = federated_array.query()
    
    observed_mean = np.mean(result)
    error = np.abs(true_mean - observed_mean)
    
    return observed_mean, error

def run_n_experiments(sensitivity, epsilon, n_runs):
    for run in range(n_runs):
        observed_mean, error = experiment(sensitivity, epsilon)
        errors[i_df, i_epsilon] = errors[i_df, i_epsilon] + error
        
    return 1/n_runs * errors[i_df, i_epsilon]

With the experiments defined, let's run them to obtain the results regarding different sensitivities:

# Run the experiments:
epsilon_range = np.arange(0.001, 20, 0.05)
df_range = df + [0, sigma, 2*sigma]
errors = np.zeros((len(df_range), len(epsilon_range)), dtype=object)

n_runs = 1
for i_df in range(len(df_range)):
    for i_epsilon in range(len(epsilon_range)):
        errors[i_df, i_epsilon] = run_n_experiments(df_range[i_df], epsilon_range[i_epsilon], n_runs)

The only thing left now is to visualize these results:

# Plot the figure:
fig, ax = plt.subplots(figsize=(9,6))
labels = ["$\mathbb{E}[\Delta f]$", "$\mathbb{E}[\Delta f] + \sigma$", "$\mathbb{E}[\Delta f] + 2\sigma$"]
for i_df in range(len(df_range)):
    ax.plot(epsilon_range, errors[i_df], label = labels[i_df])
ax.set_xlabel('$\epsilon$')
ax.set_ylabel('Error (cm)')
ax.set_yscale('log')
plt.legend(title = "Sensitivity", loc="upper right")
plt.show()

png

The error is smaller as privacy decreases (i.e. for greater values of ϵ\epsilon). Moreover, for increasing values of the expected sensitivity E[Δf]\mathbb{E}[\Delta f], the privacy increases and thus the error grows.

;