Pribatutasunaren Adimen Artifiziala

Ikusi GitHub-en

Exponential Mechanism

In this notebook, we introduce the exponential mechanism (see The Algorithmic Foundations of Differential Privacy, Section 3.4). We consider three different examples that show how general the mechanism is.

Let X\mathcal{X} be the set of possible rows in a database so that we can represent a particular database using a histogram xNXx\in\mathbb{N}^{|\mathcal{X}|}. Given an arbitrary response space R\mathcal{R} and a utility function

u:NX×RR,u: \mathbb{N}^{|\mathcal{X}|}\times \mathcal{R} \rightarrow \mathbb{R}\,,

we define the exponential mechanism as the collection of conditional probabilities given by

P(rx;u,α)=eαu(x,r)rReαu(x,r).P(r|x; u, \alpha) = \frac{e^{\alpha u(x,r) }}{\sum_{r'\in\mathcal{R}} e^{\alpha u(x,r')}}\,.

In the case of uncountable R\mathcal{R}, the sum in the denominator should be replaced with an integral and the expression above gives a probability density.

The exponential mechanism is an ϵ\epsilon-differentially private randomization algorithm with ϵ=2αΔu\epsilon = 2\alpha\Delta u and

Δu=maxrRmaxxy11u(x,r)u(y,r).\Delta u = \underset{r\in\mathcal{R}}{\rm max}\underset{||x-y||_1 \le 1}{\rm max} |u(x,r) - u(y,r)|\,.

The utility function, uu, can be thought of as a generalization of a numeric query. In fact, this mechanism can be reduced to simpler mechanisms for particular types of functions uu. In order to illustrate this, we will demonstrate how it works in two well-known examples.

Randomized Response (see also the notebook Differential Privacy Basic Concepts)

Consider the case in which R={0,1}\mathcal{R} = \{0, 1\} and X={truly innocent,truly guilty}\mathcal{X} = \{\text{truly innocent}, \text{truly guilty}\} with a utility function such that

u(0,0)=0u(1,0)=0u(0, 0) = 0\qquad u(1, 0) = 0
u(0,1)=βu(1,1)=γu(0, 1) = \beta \qquad u(1, 1) = \gamma

where β,γ\beta, \gamma are real constants. In this case, the exponential mechanism reduces to the simplest randomized response algorithm.

Laplace Mechanism (see also the notebook Laplace Mechanism)

Consider the case in which R=R\mathcal{R} = \mathbb{R} and the utility function is given by u(x,r)=f(x)ru(x, r) = -|f(x) - r| with

f:NXR.f: \mathbb{N}^{|\mathcal{X}|} \rightarrow \mathbb{R}\,.

This is, by definition, the Laplace mechanism with b=α1b = \alpha^{-1}.

Example: Pricing

The input dataset xx is a set of bids for the purchase of an abundant supply of a product. The problem is to identify the best price rR=[rmin,rmax]r \in \mathcal{R} = [r_{min}, r_{max}] so as to maximize the revenue, without revealing the bids. In this case the revenue is our utility function uu, defined as

u(x,r)=rSru(x, r) = r |S_r|

with

Sr={i:xir}S_r = \{i: x_i\ge r\}\,

the set of people that are willing to buy at a price rr.

In general, it is not possible to compute Δu\Delta u analytically and you must resort to statistical estimations of "typical" values of Δu\Delta u (see Rubinstein 2017).

This is also the case when Δu\Delta u is not bounded. In our particular case, we can compute the sensitivity with respect to the utility, which is given by

Δu=rmax.\Delta u = r_{max}.

The complication in this case arises from the fact that the output price may not be directly perturbed. In Example 3.5 of The Algorithmic Foundations of Differential Privacy, they suppose there are four bidders: A, F, I, K, where A, F, and I each bid 1.00 and K bids 3.01, so x={1,1,1,3.01}x = \{1, 1, 1, 3.01\}.
If we fix the price at 3.01, the revenue is u=3.01u=3.01, at 3.003.00 we have u=3.00u=3.00, at 1.001.00 we have u=4.00u=4.00. However, if we fix the price at 3.023.02, the revenue is zero! The revenue plot for this specific demand defined by xx is shown below.

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm

def u(x, r):
    '''
    Utility function

    Arguments:
        x: list. True bids.
        r_range: array of possible values for the price.
    '''
    output = np.zeros(len(r))
    for i in range(len(r)):
        output[i] = r[i] * sum(np.greater_equal(x, r[i]))
    return output

x = [1.00, 1.00, 1.00, 3.01] # Input dataset: the true bids
r = np.arange(0, 3.5, 0.001) # Set the interval of possible outputs r

# Plot the utility (revenue) for each possible output r (price)
utility = u(x, r)
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots(figsize=(9, 6))
x_intervals = np.sort(np.unique(np.append(x, [r.min(), r.max()])))
for i_interval in range(len(x_intervals) - 1):
    indices_interval = [all(union) for union in zip(
        r > x_intervals[i_interval], r <= x_intervals[i_interval + 1])]
    ax.plot(r[indices_interval], utility[indices_interval], color = "blue")

ax.set_xlim([r.min(), r.max()])
ax.set_xlabel('$r$')
ax.set_ylabel('Utility')
label="Utility (i.e. revenue) for all possible prices $r$"
ax.text(r.max()/2, -1, label, ha='center')

plt.show()

png

As explained above, the exponential mechanism is defined with respect to the utility function u:NX×RRu : \mathbb{N}^{|\mathcal{X}|} \times \mathcal{R} \rightarrow \mathbb{R}, which maps database/output pairs to utility scores. For a fixed database xx, the exponential mechanism outputs each possible rRr \in \mathcal{R} with probability proportional to exp(ϵu(x,r)2Δu)\exp\left(\frac{\epsilon u\left(x, r\right)}{2\Delta u}\right). The resulting probability distribution is shown below.

It can be observed that for low values of ϵ\epsilon the probability resembles a flat, horizontal curve of the uniform probability, thus the privacy increases. On the contrary, for higher values of ϵ\epsilon, the probability curve exponentially reveals the jumps in the revenue, implying less privacy.

def PDF(x, r, epsilon):
    """
    PDF associated to the database x
    """
    r_min = r.min()
    r_max = r.max()    
    x_intervals = np.sort(np.unique(np.append(x, [r_min, r_max])))
    area = 0
    for i in range(len(x_intervals) - 1):
        S = epsilon/(2*r_max) * sum(np.greater(x, x_intervals[i]))
        if S > 0:
            area_int = 1/S * (np.exp(S * x_intervals[i + 1]) - np.exp(S * x_intervals[i]))
        elif S == 0:
            area_int = x_intervals[i + 1] - x_intervals[i]
        area = area + area_int

    u_prob_norm = np.exp(epsilon * u(x, r) / (2 * r_max)) / area
    return u_prob_norm


epsilon_range = [0.01, 1, 5]

fig, ax = plt.subplots(figsize=(9,6))
color_list = ["b", "g", "r", "c", "m"]

x_intervals = np.sort(np.unique(np.append(x, [r.min(), r.max()])))
for i_epsilon in range(len(epsilon_range)):
    u_prob_norm = PDF(x, r, epsilon_range[i_epsilon])
    for i_interval in range(len(x_intervals) - 1):
        indices_interval = [all(union) for union in zip(
            r > x_intervals[i_interval], r <= x_intervals[i_interval + 1])]
        ax.plot(r[indices_interval],
                u_prob_norm[indices_interval],
                color = color_list[i_epsilon],
                label = '$\epsilon = $' + str(epsilon_range[i_epsilon]))

handles, labels = ax.get_legend_handles_labels()  
handles = handles[0:len(handles):(len(x)-1)]
labels = labels[0:len(labels):(len(x)-1)]
ax.set_xlim([r.min(), r.max()])
ax.set_ylim([-0.1, max(u_prob_norm) + 0.1])
ax.set_xlabel('$r$')
ax.set_ylabel('Probability density')
ax.legend(handles, labels)
label="Probability density function for different levels of privacy $\epsilon$. For higher levels of privacy (i.e. lower $\epsilon$), \n\
the probability resembles the uniform distribution, while for lower levels of privacy (i.e. higher $\epsilon$), \n\
the probability reveals the jumps in revenue and thus the bids."
ax.text(r.max()/2, -0.5, label, ha='center')

plt.show()

png

We create a node that contains the true bids and choose to access the data using the exponential mechanism. We repeat the experiment 10,000 times to check that the resulting distribution for the output price matches the one shown earlier.

from shfl.differential_privacy.dp_mechanism import ExponentialMechanism
from shfl.private.node import DataNode

node = DataNode()    # Create a node
node.set_private_data(name="bids", data=np.array(x)) # Store the database x in the node
delta_u = r.max()    # In this specific case, Delta u = max(r)
epsilon = 5          # Set a value for epsilon
size = 10000         # We want to repeat the query this many times

data_access_definition = ExponentialMechanism(u, r, delta_u, epsilon, size)
node.configure_data_access("bids", data_access_definition)
result = node.query("bids")
fig, ax = plt.subplots(figsize=(9,6))
plt.hist(result, bins = int(round(np.sqrt(len(result)))), density = True)
ax.set_xlabel('$r$')
ax.set_ylabel('Count')
label="Normalized histogram showing 10,000 samplings over the probability defined by the privacy level $\epsilon=5$. \n\
The sampled prices recover the actual distribution of the prices (see curve $\epsilon=5$ in the previous picture)."
ax.text(r.max()/2, -0.25, label, ha='center')

plt.show()

png

The trade-off between accuracy and privacy can be assessed by computing the mean revenue loss, which is defined as

loss=uOPT(x)u(x,rsample){\rm {loss}}= |u_{\rm{OPT}}(x) - u(x,r_{\rm{sample}})|

where uOPT(x)u_{\rm{OPT}}(x) is the highest possible revenue for a fixed database xx, and u(x,rsample)u(x,r_{\rm{sample}}) is the revenue at the sampled price rsampler_{\rm{sample}}.

This is a measure of the loss in utility that we get when using the exponential mechanism compared to the case without any privacy at all. The plot below shows that as ϵ\epsilon increases (i.e. privacy decreases) the revenue loss goes towards zero.

epsilon_range = np.arange(0.001, 100, 10)
optimum_price = float(r[utility == max(utility)])
optimum_revenue = u(x, [optimum_price])
mean_loss = np.zeros(len(epsilon_range))

for i in range(len(epsilon_range)):
    epsilon = epsilon_range[i]
    node = DataNode()
    node.set_private_data(name="bids", data=np.array(x))
    data_access_definition = ExponentialMechanism(u, r, delta_u, epsilon, size)
    node.configure_data_access("bids", data_access_definition)
    result = node.query("bids")
    mean_loss[i] = np.mean(abs(u(x, result) - optimum_revenue))

fig, ax = plt.subplots(figsize=(9,6))
ax.plot(epsilon_range, mean_loss)
ax.set_xlim([0, max(epsilon_range)])
ax.set_ylim([-0.1, max(mean_loss)])
ax.set_xlabel('$\epsilon$')
ax.set_ylabel('Mean revenue loss')
label="Mean revenue loss for different levels of privacy $\epsilon$. \
As privacy decreases (i.e. higher $\epsilon$), the revenue loss goes to zero."
ax.text(epsilon_range.max()/2, -0.5, label, ha='center')

plt.show()

png

Example: Randomized Response from the Exponential Mechanism (see also Differential privacy basic concepts)

Consider the case in which R={0,1}\mathcal R = \{0, 1\} and X={truly innocent,truly guilty}\mathcal X = \{\text{truly innocent}, \text{truly guilty}\} with a utility given by

u(0,0)=0u(1,0)=0u(0, 0) = 0\qquad u(1, 0) = 0
u(0,1)=βu(1,1)=βu(0, 1) = \beta \qquad u(1, 1) = -\beta

This means that Δu=2β\Delta u = 2|\beta| so ϵ=4αβ\epsilon = 4\alpha |\beta|. For concreteness, in the following we choose β=1/4\beta = -1/4 so that ϵ=α\epsilon = \alpha.

In the figure below we show the percentage of times the query returns the same value as the true value as a function of ϵ\epsilon. We observe that this percentage approaches one as the privacy decreases. For ϵ=0\epsilon=0,
which provides maximal privacy, the result of the query is completely uninformative since the probability of getting either result is independent of the true value.

def u_randomized(x, r):
    '''
    Utility function for randomized mechanism

    Arguments:
        x: binary number
        r: array of binaries
    '''
    output = -1/4 * (1 - 2*x) * r
    return output

r = np.array([0,1]) # Set the interval of possible outputs r
x = 1               # Set a value for the dataset
delta_u = 1         # We simply set it to one
size = 100000       # We want to repeat the query this many times

epsilon_range = np.arange(0.001, 100, 10)
positives = np.zeros(len(epsilon_range))

for i in range(len(epsilon_range)):
    epsilon = epsilon_range[i]
    node = DataNode()               # Create a node
    node.set_private_data(name="identity", data=np.array(x))  # Store the database x in the node
    data_access_definition = ExponentialMechanism(u_randomized, r, delta_u, epsilon, size)
    node.configure_data_access("identity", data_access_definition)
    result = node.query("identity")
    positives[i] = result.sum() / size

fig, ax = plt.subplots(figsize=(9,6))
ax.plot(epsilon_range, positives)
ax.set_xlim([0, max(epsilon_range)])
ax.set_xlabel('$\epsilon$')
ax.set_ylabel('Percentage of positives')
label="Randomized response: percentage of time the query returns the true value, as a function of the privacy level $\epsilon$.\n\
For a lower degree of privacy (i.e. higher $\epsilon$), the query returns the true value almost 100% of the times. On the contrary, \n\
for a higher degree of privacy (i.e. lower $\epsilon$), the query returns either true or false 50% of the time, thus being completely uninformative."
ax.text(epsilon_range.max()/2, 0.3, label, ha='center')

plt.show()

png

Example: Laplace Mechanism from the Exponential Mechanism (see also the notebook Laplace Mechanism)

We choose the utility function to be u(x,r)=f(x)ru(x,\,r) = -\left| f(x) - r \right| with rRr\in\mathbb R. By substituting it into the exponential mechanism, we obtain an algorithm with conditional probability

P(rx;f,α)=α2eαf(x)r.P(r|x; f, \alpha) = \frac{\alpha}{2}e^{-\alpha |f(x) - r|}\,.

In this example we pick f(x)f(x) to be the identity. We then only need to define the utility function. Below, we show a (normalized) histogram of the result of performing the query with Laplace noise, repeatedly. As you can see, it gives us the Laplace distribution centered at the true value f(x)f(x).

def u_laplacian(x, r):
    '''
    Utility function for Laplacian mechanism

    Arguments:
        x: float.
        r: array of reals.
    '''
    output = -np.absolute(x - r)
    return output

# Define some example values:
r = np.arange(-20, 20, 0.001) # Set the interval of possible outputs r
x = 3.5                       # Set a value for the dataset
delta_u = 1                   # We simply set it to one
epsilon = 1                   # Set a value for epsilon
size = 100000                 # We want to repeat the query this many times

node = DataNode()             # Create a node
node.set_private_data(name="identity", data=np.array(x)) # Store the database x in the node

data_access_definition = ExponentialMechanism(u_laplacian, r, delta_u, epsilon, size)
node.configure_data_access("identity", data_access_definition)
result = node.query("identity")

fig, ax = plt.subplots(figsize=(9,6))
plt.hist(result, bins = int(round(np.sqrt(len(result)))), density = True, label="Output from query")
plt.axvline(x=x, linewidth=2, color='r', label= "True value")
ax.set_xlabel('$r$')
ax.set_ylabel('Count')
ax. legend(facecolor='white', framealpha=1, frameon=True)
label="Laplace mechanism: normalized histogram showing 10'000 identity queries, for $\Delta u = 1$ and $\epsilon = 1$. \n\
Due to the noise added by the Laplace mechanism, the values returned by the query recover\n\
the Laplace distribution centered at the true value."
ax.text((r.max()+r.min())/2, -0.07, label, ha='center')

plt.show()

png