Pribatutasunaren Adimen Artifiziala

Ikusi GitHub-en

Laplace Mechanism

We are now going to introduce the Laplace mechanism (see The Algorithmic Foundations of Differential Privacy, Section 3.3). Let X\mathcal X be the set of possible rows in a database so that we can represent a particular database using a histogram xNXx\in\mathbb N^{|\mathcal X|}.

Given a particular function f:NXRkf:\mathbb N^{|\mathcal X|}\rightarrow \mathbb R^k, we define the Laplace mechanism as the collection of conditional probabilities given by

p(zf(x);b)=i=1k12be1bf(x)izip(z|f(x); b) = \prod_{i=1}^k \frac{1}{2b} e^{-\frac{1}{b}| f(x)_i - z_i| }

where zRkz\in\mathbb R^k. We can show that this mechanism is ϵ\epsilon-differentially private with ϵ=Δfb\epsilon = \frac{\Delta f}{b} and with the sensitivity Δf\Delta f defined as

Δf=maxxy1=1f(x)f(y)1.\Delta f = \underset{||x-y||_1 = 1}{\rm max} ||f(x) - f(y)||_1\,.

Notice that the Laplace mechanism is a randomization algorithm that depends on the function ff, which can be regarded as a numeric query. In order to apply this mechanism for a particular value of ϵ\epsilon, we need to compute Δf\Delta f, which might be hard to do, in practice. Furthermore, there are cases in which Δf\Delta f is unbounded, so there is no way to make ϵ\epsilon finite. Thus, for practical purposes, we may find a compromise and, instead of using the worst-case-scenario Δf\Delta f, we can use a "typical" sensitivity. This typical sensitivity can be estimated by taking the databases x,yx, y to be random variables, which allows the distribution of f(x)f(y)1||f(x) - f(y)||_1 to be computed.

Example: Mean Height

We consider the case in which we want to estimate the mean height of a group of people. However, the people do not want to reveal their exact height, so some Laplace noise is introduced. The first thing we need to do is decide what amount of noise that should be added to their true height.

In this case, the database consists simply of one record, height, and the function ff returns the height. Now, since the height is not bounded, the sensitivity Δf\Delta f is infinite. However, in practice, humans are typically not more than 2m tall and never more than 3m tall. So this is a case in which the use of a typical sensitivity is in order.

The amount of noise we should introduce depends on the typical value of the height of the general population. For example, adding noise of about 0.1cm is clearly too little. On the other hand, adding noise of about 1m is too much, in the sense that it would be useless to compute an average height.

The distribution of heights of adult males is normal with mean μM=178cm\mu_M = 178\,{\rm cm} and σM=7cm\sigma_M = 7\,{\rm cm}. Similarly, for females, the distribution is normal with μF=162cm\mu_F = 162\,{\rm cm} and σF=6.5cm\sigma_F = 6.5\,{\rm cm}. Assuming that there are the same number of males and females, the height distribution would be

p(h)=12[N(hμM,σM2)+N(hμF,σF2)].p(h) = \frac{1}{2}\left [ \mathcal N \left (h\,|\,\mu_M, \sigma_M^2\right ) + \mathcal N\left (h\,|\,\mu_F, \sigma_F^2\right ) \right ]\,.

The mean of the distribution of heights in the general population is μ=12(μM+μF)\mu = \frac{1}{2}(\mu_M + \mu_F) and the variance is given by

σ2=12(σM2+σF2)+14(μMμF)2.\sigma^2 = \frac{1}{2}\left ( \sigma_M^2 + \sigma_F^2 \right ) + \frac{1}{4}\left ( \mu_M - \mu_F \right )^2\,.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

mu_M = 178
mu_F = 162
sigma_M = 7
sigma_F = 6.5

mu = 1/2 * (mu_M + mu_F)
sigma = np.sqrt(1/2 * (sigma_M**2 + sigma_F**2) + 1/4 * (mu_M - mu_F)**2)

x = np.arange(mu - sigma, mu + sigma, 0.001) # range of x in spec
x_all = np.arange(0, 220, 0.001) # entire range of x, both in and out of spec
y = 1/2 * (norm.pdf(x, mu_M, sigma_M) + norm.pdf(x, mu_F, sigma_F))
y_M = norm.pdf(x_all, mu_M, sigma_M)
y_F = norm.pdf(x_all, mu_F, sigma_F)
y_T = 1/2 * (y_M + y_F)

plt.style.use('fivethirtyeight')
fig, ax = plt.subplots(figsize=(9,6))
ax.plot(x_all, y_M, label="Male")
ax.plot(x_all, y_F, label="Female")
ax.plot(x_all, y_T, label="General")

ax.fill_between(x, y, 0, alpha=0.3, color='y', label="$\mu\pm\sigma$")
ax.fill_between(x_all, y_M, 0, alpha=0.1)
ax.fill_between(x_all, y_F, 0, alpha=0.1)
ax.fill_between(x_all, y_T, 0, alpha=0.1)
ax.set_xlim([100, 220])
ax.set_xlabel('height (cm)')
ax.set_yticklabels([])
ax.set_title('Height Distribution')
plt.legend(loc="upper left")

plt.show()

png

In order to estimate the typical value of the sensitivity, we can compute its expected value E[Δf]\mathbb E[\Delta f]. This gives us

E[Δf]=12π(σM+σF+2(σM2+σF2)(em2+πmerf(m)))\mathbb E[\Delta f] = \frac{1}{2\sqrt{\pi}}\left ( \sigma_M + \sigma_F + \sqrt{2(\sigma_M^2 + \sigma_F^2)}\left (e^{-m^2} + \sqrt{\pi}\,m \,{\rm erf}(m)\right ) \right )

with m=μMμF2(σM2+σF2)m = \frac{\mu_M - \mu_F}{\sqrt{2(\sigma_M^2 + \sigma_F^2)}}. In principle, we could use other statistics, such as E[Δf]+σ\mathbb E[\Delta f] + \sigma or some high percentile.

from scipy.special import erf
import math

m = (mu_M - mu_F)/np.sqrt((2*(sigma_M**2 + sigma_F**2)))
df = 1/(2*np.sqrt(math.pi)) * (sigma_M + sigma_F + np.sqrt(2*(sigma_M**2 + sigma_F**2))
                               * (np.exp(-m**2) + np.sqrt(math.pi)*m*erf(m) ) )

print("Expected sensitivity: " + str(round(df,2)) + " cm")
Expected sensitivity: 11.99 cm

Notice that, in general, it is not possible to compute the expected value (or any other statistic) of Δf\Delta f exactly. However, the statistics can be estimated using a sampling procedure (see Rubinstein 2017). That is, instead of computing the global sensitivity Δf\Delta f analytically, we compute an empirical estimation of it by sampling over the dataset. The framework provides a method for sampling the sensitivity (see the implementation here).

from shfl.private import IdentityFunction
from shfl.differential_privacy import GaussianMixture
from shfl.differential_privacy import SensitivitySampler
from shfl.differential_privacy import L1SensitivityNorm

norm_params = np.array([[mu_M, sigma_M],
                        [mu_F, sigma_F]])

weights = np.ones(2) / 2.0
distribution = GaussianMixture(norm_params, weights)

sampler = SensitivitySampler()
sensitivity, mean = sampler.sample_sensitivity(IdentityFunction(), L1SensitivityNorm(), distribution, n=1, gamma=0.05)
print("Sensitivity from sampling: " + str(round(sensitivity,2)))
print("Mean sensitivity from sampling: " + str(round(mean,2)))
Sensitivity from sampling: 51.71
Mean sensitivity from sampling: 12.18

As we can see, the value obtained is approximately the expected value for the sensitivity.

For smaller values of ϵ\epsilon , the amount of privacy is higher, but the utility is lower. For example, if you wish to compute the mean height over a given population, the accuracy of the result would degraded as the amount of privacy increases. This is shown in the experiment below.

In particular, we consider a population of 500 people with a mean height truemeantrue_mean. However, these people reported their height with Laplace noise controlled by ϵ\epsilon and the sensitivity Δf\Delta f, which is then used to estimate the mean height. As ϵ\epsilon decreases, the error in this estimate increases.

from shfl.differential_privacy import LaplaceMechanism
from shfl.private.federated_operation import federate_array

def experiment(sensitivity, epsilon):
    true_heights = np.random.normal(170, 1, 500)
    true_mean = np.mean(true_heights)

    federated_array = federate_array(true_heights, len(true_heights))

    # Define allowed access to data.
    data_access_definition = LaplaceMechanism(sensitivity, epsilon)
    federated_array.configure_data_access(data_access_definition)

    # Query data
    result = federated_array.query()

    observed_mean = np.mean(result)
    error = np.abs(true_mean - observed_mean)

    return observed_mean, error

def run_n_experiments(sensitivity, epsilon, n_runs):
    for run in range(n_runs):
        observed_mean, error = experiment(sensitivity, epsilon)
        errors[i_df, i_epsilon] = errors[i_df, i_epsilon] + error

    return 1/n_runs * errors[i_df, i_epsilon]
# Run the experiments:
epsilon_range = np.arange(0.001, 3, 0.05)
df_range = df + [0, sigma, 2*sigma]
errors = np.zeros((len(df_range), len(epsilon_range)), dtype=object)

n_runs = 1
for i_df in range(len(df_range)):
    for i_epsilon in range(len(epsilon_range)):
        errors[i_df, i_epsilon] = run_n_experiments(df_range[i_df], epsilon_range[i_epsilon], n_runs)
# Plot the figure:
fig, ax = plt.subplots(figsize=(9,6))
labels = ["$\mathbb{E}[\Delta f]$", "$\mathbb{E}[\Delta f] + \sigma$", "$\mathbb{E}[\Delta f] + 2\sigma$"]
for i_df in range(len(df_range)):
    ax.plot(epsilon_range, errors[i_df], label = labels[i_df])
ax.set_xlabel('$\epsilon$')
ax.set_ylabel('Error (cm)')
ax.set_yscale('log')
plt.legend(title = "Sensitivity", loc="upper right")
caption = "The error grows as privacy increases (i.e. for smaller values of $\epsilon$). \n\
Moreover, for increasing values of the expected sensitivity $\mathbb{E}[\Delta f]$, \n\
the privacy increases and thus the error grows."
ax.text(0.1, 0.005, caption, ha='left')
plt.show()

png