Pribatutasunaren Adimen Artifiziala

Ikusi GitHub-en

Unsupervised Federated Learning: K-means Clustering Model

This notebook covers the problem of unsupervised learning in a federated configuration. In particular, a K-means clustering is used from the sklearn library (see this link).

The framework provides some functions to load the Iris dataset.

import matplotlib.pyplot as plt
import shfl
import numpy as np
from shfl.data_base.iris import Iris


# Assign database:
database = Iris()

train_data, train_labels, test_data, test_labels = database.load_data()
print(train_data.shape)
print(test_data.shape)

# Visualize training data:
fig, ax = plt.subplots(1,2, figsize=(16,8))
fig.suptitle("Iris database", fontsize=20)
ax[0].set_title('True labels', fontsize=18)
ax[0].scatter(train_data[:, 0], train_data[:, 1], c=train_labels, s=150, edgecolor='k', cmap="plasma")
ax[0].set_xlabel('Sepal length', fontsize=18)
ax[0].set_ylabel('Sepal width', fontsize=18)
ax[0].tick_params(direction='in', length=10, width=5, colors='k', labelsize=20)

ax[1].set_title('True labels', fontsize=18)
ax[1].scatter(train_data[:, 2], train_data[:, 3], c=train_labels, s=150, edgecolor='k', cmap="plasma")
ax[1].set_xlabel('Petal length', fontsize=18)
ax[1].set_ylabel('Petal width', fontsize=18)
ax[1].tick_params(direction='in', length=10, width=5, colors='k', labelsize=20)

plt.show()
(135, 4)
(15, 4)

png

We implement a method to plot K-means results in the Iris database and establish a centralized model, which will be our reference model.

from shfl.model.kmeans_model import KMeansModel

def plot_k_means(km, X, title):
    new_labels = km.predict(X)
    fig, axes = plt.subplots(1, 2, figsize=(16,8))
    fig.suptitle(title, fontsize=20)
    axes[0].scatter(X[:, 0], X[:, 1], c=new_labels, cmap='plasma', edgecolor='k', s=150)
    axes[0].set_xlabel('Sepal length', fontsize=18)
    axes[0].set_ylabel('Sepal width', fontsize=18)
    axes[0].tick_params(direction='in', length=10, width=5, colors='k', labelsize=20)
    axes[0].set_title('Predicted', fontsize=18)

    axes[1].scatter(X[:, 2], X[:, 3], c=new_labels, cmap='plasma', edgecolor='k', s=150)
    axes[1].set_xlabel('Petal length', fontsize=18)
    axes[1].set_ylabel('Petal width', fontsize=18)
    axes[1].tick_params(direction='in', length=10, width=5, colors='k', labelsize=20)
    axes[1].set_title('Predicted', fontsize=18)

# Plot training data:
centralized_model = KMeansModel(n_clusters=3, n_features = train_data.shape[1], init = np.zeros((3,4)))
centralized_model.train(train_data)

print(centralized_model.get_model_params())
plot_k_means(centralized_model, train_data, title = "Benchmark: K-means using centralized data")
[[4.99285714 3.40714286 1.45       0.24047619]
 [6.85833333 3.06666667 5.75       2.07777778]
 [5.89824561 2.74210526 4.4        1.43859649]]

png

How to aggregate a model's parameters from each federated node in clustering

Since the labels of clusters can vary among each node, we cannot average the centroids right away. One solution is to choose the lowest distance average: this is achieved by simply applying the K-means algorithm to the centroids coordinates of all nodes. In ClusterFedAvgAggregator, you can see its implementation.

Note: This implementation is based on the assumption that the number of clusters has been previously fixed across the clients, so it only works properly in IID scenarios (see Federated Sampling). We are working in a federated aggregation operator, which works in every distribution of data, across clients.

from shfl.federated_aggregator.cluster_fedavg_aggregator import ClusterFedAvgAggregator

# Create the IID data:
iid_distribution = shfl.data_distribution.IidDataDistribution(database)
federated_data, test_data, test_label = iid_distribution.get_federated_data(num_nodes = 12, percent=100)
print("Number of nodes: " + str(federated_data.num_nodes()))

# Run the algorithm:
aggregator = ClusterFedAvgAggregator()
Number of nodes: 12

We are now ready to run our model in a federated configuration.

The performance is assessed by several clustering metrics (see this link).

For reference, below, we compare the metrics of:

  • Each node
  • The global (federated) model
  • The centralized (non-federated) model

It can be observed that the performance of global federated model is superior, in general, with respect to the performance of each node. Thus, the federated learning approach proves to be beneficial. Moreover, the performance of the global federated model is very close to the performance of the centralized model.

from shfl.federated_government.federated_government import FederatedGovernment

n_clusters = 3 # Set number of clusters
n_features = train_data.shape[1]
def model_builder():
    model = KMeansModel(n_clusters=n_clusters, n_features = n_features)
    return model


federated_government = FederatedGovernment(model_builder, federated_data, aggregator)
print("Test data size: " + str(test_data.shape[0]))
print("\n")
federated_government.run_rounds(n = 3, test_data = test_data, test_label = test_label)

# Reference centralized (non federate) model:
print("Centralized model test performance : " + str(centralized_model.evaluate(data=test_data, labels=test_labels)))
plot_k_means(centralized_model, test_data, title = "Benchmark on Test data: K-means using CENTRALIZED data")
plot_k_means(federated_government.global_model, test_data, title = "Benchmark on Test data: K-means using FL")
Test data size: 15


Accuracy round 0
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x14188e690>: (0.8347875051269761, 0.8687756288885652, 0.8514425151306295, 0.8763250883392226)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x141b648d0>: (0.8347875051269761, 0.8687756288885652, 0.8514425151306295, 0.8763250883392226)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bb350>: (0.8514718749116273, 0.8514718749116273, 0.8514718749116273, 0.8748012718600954)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bb3d0>: (0.8347875051269761, 0.8687756288885651, 0.8514425151306293, 0.8763250883392226)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bb590>: (1.0, 1.0, 1.0, 1.0)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bf510>: (0.8347875051269761, 0.8687756288885651, 0.8514425151306293, 0.8763250883392226)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bfb10>: (0.8514718749116273, 0.8514718749116273, 0.8514718749116273, 0.8748012718600954)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bff50>: (1.0, 1.0, 1.0, 1.0)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bfd10>: (0.7253812460663952, 0.830041729736677, 0.7741903180935481, 0.7585281717133001)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bfcd0>: (0.8347875051269761, 0.8687756288885652, 0.8514425151306295, 0.8763250883392226)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x141b8bf90>: (0.7478187423095356, 0.8557165845714431, 0.7981375767908175, 0.7987734764277501)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x141b8b250>: (0.7777989646831382, 0.8094668170538627, 0.7933169850872153, 0.7938751472320377)
Global model test performance : (1.0, 1.0, 1.0, 1.0)



Accuracy round 1
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x14188e690>: (0.8347875051269761, 0.8687756288885652, 0.8514425151306295, 0.8763250883392226)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x141b648d0>: (0.8347875051269761, 0.8687756288885652, 0.8514425151306295, 0.8763250883392226)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bb350>: (0.8514718749116273, 0.8514718749116273, 0.8514718749116273, 0.8748012718600954)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bb3d0>: (0.8347875051269761, 0.8687756288885652, 0.8514425151306295, 0.8763250883392226)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bb590>: (1.0, 1.0, 1.0, 1.0)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bf510>: (0.8347875051269761, 0.8687756288885652, 0.8514425151306295, 0.8763250883392226)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bfb10>: (0.8514718749116273, 0.8514718749116273, 0.8514718749116273, 0.8748012718600954)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bff50>: (1.0, 1.0, 1.0, 1.0)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bfd10>: (0.8514718749116273, 0.8514718749116273, 0.8514718749116273, 0.8748012718600954)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bfcd0>: (0.8347875051269761, 0.8687756288885652, 0.8514425151306295, 0.8763250883392226)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x141b8bf90>: (0.8347875051269761, 0.8687756288885652, 0.8514425151306295, 0.8763250883392226)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x141b8b250>: (0.7777989646831382, 0.8094668170538627, 0.7933169850872153, 0.7938751472320377)
Global model test performance : (1.0, 1.0, 1.0, 1.0)



Accuracy round 2
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x14188e690>: (0.8347875051269761, 0.8687756288885652, 0.8514425151306295, 0.8763250883392226)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x141b648d0>: (0.8347875051269761, 0.8687756288885652, 0.8514425151306295, 0.8763250883392226)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bb350>: (0.8514718749116273, 0.8514718749116273, 0.8514718749116273, 0.8748012718600954)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bb3d0>: (0.8347875051269761, 0.8687756288885652, 0.8514425151306295, 0.8763250883392226)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bb590>: (1.0, 1.0, 1.0, 1.0)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bf510>: (0.8347875051269761, 0.8687756288885652, 0.8514425151306295, 0.8763250883392226)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bfb10>: (0.8514718749116273, 0.8514718749116273, 0.8514718749116273, 0.8748012718600954)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bff50>: (1.0, 1.0, 1.0, 1.0)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bfd10>: (0.8514718749116273, 0.8514718749116273, 0.8514718749116273, 0.8748012718600954)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bfcd0>: (0.8347875051269761, 0.8687756288885652, 0.8514425151306295, 0.8763250883392226)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x141b8bf90>: (0.8347875051269761, 0.8687756288885652, 0.8514425151306295, 0.8763250883392226)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x141b8b250>: (0.7777989646831382, 0.8094668170538627, 0.7933169850872153, 0.7938751472320377)

Global model test performance : (1.0, 1.0, 1.0, 1.0)

Centralized model test performance : (0.8347875051269761, 0.8687756288885652, 0.8514425151306295, 0.8763250883392226)

png

png

Differentially Private Version

To preserve client privacy, in this section, we are going to introduce Differential Privacy (DP) into our model. First, we calibrate the noise introduced by the differentially private mechanism using the training data, then we apply DP to each client feature, so that each cluster computed by a client is shared with the main server privately; that is, without disclosing the identity of the client.

In the case of applying the Gaussian privacy mechanism , the noise added has to be of the same order as the sensitivity of the model's output, i.e., the coordinates of each cluster.

In the general case, the model's sensitivity might be difficult to compute analytically. An alternative approach is to attain random differential privacy through a sampling over the data.

That is, instead of computing the global sensitivity Δf\Delta f analytically, we compute an empirical estimation of it, by sampling over the dataset. This approach is very convenient, since it allows for the sensitivity estimation of an arbitrary model or a black-box computer function. The Sherpa.ai Federated Learning and Differential Privacy Framework framework provides this functionality in the class SensitivitySampler.

In order to carry out this approach, we need to specify a distribution of the data to sample from. Generally, this requires previous knowledge and/or model assumptions. However, in our specific case of manufactured data, we can assume that the data distribution is uniform. We define our class of ProbabilityDistribution that uniformly samples over a data-frame. Moreover, we assume that we have access to a set of data (this can be thought of, for example, as a public data set). In this example, we generate a new dataset, and use its training partition for sampling:

import numpy as np

class UniformDistribution(shfl.differential_privacy.ProbabilityDistribution):
    """
    Implement Uniform sampling over the data
    """
    def __init__(self, sample_data):
        self._sample_data = sample_data

    def sample(self, sample_size):
        row_indices = np.random.randint(low=0, high=self._sample_data.shape[0], size=sample_size, dtype='l')

        return self._sample_data[row_indices, :]

sample_data = train_data

The class SensitivitySampler implements the sampling, given a query (i.e., the learning model itself, in this case). We only need to add the get method to our model since it is required by the class SensitivitySampler. We choose the sensitivity norm to be the v norm and we apply the sampling. The value of the sensitivity depends on the number of samples n: the more samples we perform, the more accurate the sensitivity. Indeed, upon increasing the number of samples n, the sensitivity becomes more accurate and typically decreases.

Unfortunately, sampling over a dataset involves the training of the model on two datasets differing in one entry, at each sample. Thus, in general, this procedure might be computationally expensive (e.g., in the case of training a deep neural network).

from shfl.differential_privacy import SensitivitySampler
from shfl.differential_privacy import L2SensitivityNorm

class KMeansSample(KMeansModel):

    def __init__(self, feature, **kargs):
        self._feature = feature
        super().__init__(**kargs)

    def get(self, data_array):
        self.train(data_array)
        params = self.get_model_params()
        return params[:, self._feature]

distribution = UniformDistribution(sample_data)
sampler = SensitivitySampler()
# Reproducibility
np.random.seed(789)
n_samples = 50

sensitivities = np.empty(n_features)

for i in range(n_features):
    model = KMeansSample(feature=i, n_clusters=n_clusters, n_features=n_features)
    sensitivities[i], _ = sampler.sample_sensitivity(model, L2SensitivityNorm(), distribution, n=n_samples, gamma=0.05)
print("Max sensitivity from sampling: ", np.max(sensitivities))
print("Min sensitivity from sampling: ", np.min(sensitivities))
print("Mean sensitivity from sampling:", np.mean(sensitivities))
Max sensitivity from sampling:  6.792526673302205
Min sensitivity from sampling:  1.5088193135977614
Mean sensitivity from sampling: 3.706067579248054

Generally, if the model has more than one feature, it is a bad idea to estimate the sensitivity for all of the features at the same time, as the features may have wildly varying sensitivities. In this case, we estimate the sensitivity for each feature. Note that we provide the array of estimated sensitivities to the GaussianMechanism and it apply it to each feature individually.

from shfl.differential_privacy import GaussianMechanism

dpm = GaussianMechanism(sensitivity=sensitivities, epsilon_delta=(0.9, 0.9))
federated_government = FederatedGovernment(
    model_builder, federated_data, aggregator, model_params_access = dpm)
print("Test data size: " + str(test_data.shape[0]))
print("\n")
federated_government.run_rounds(n = 1, test_data = test_data, test_label = test_label)

# Reference Centralized (non federate) model:
print("Centralized model test performance : " + str(centralized_model.evaluate(data=test_data, labels=test_labels)))
plot_k_means(centralized_model, test_data, title = "Benchmark on Test data: K-means using CENTRALIZED data")
plot_k_means(federated_government.global_model, test_data, title = "Benchmark on Test data: K-means using FL and DP")
Test data size: 15


Accuracy round 0
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x14188e690>: (0.8347875051269761, 0.8687756288885651, 0.8514425151306293, 0.8763250883392226)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x141b648d0>: (0.8347875051269761, 0.8687756288885652, 0.8514425151306295, 0.8763250883392226)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bb350>: (0.8514718749116273, 0.8514718749116273, 0.8514718749116273, 0.8748012718600954)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bb3d0>: (0.8347875051269761, 0.8687756288885651, 0.8514425151306293, 0.8763250883392226)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bb590>: (1.0, 1.0, 1.0, 1.0)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bf510>: (0.8347875051269761, 0.8687756288885652, 0.8514425151306295, 0.8763250883392226)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bfb10>: (0.8514718749116273, 0.8514718749116273, 0.8514718749116273, 0.8748012718600954)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bff50>: (1.0, 1.0, 1.0, 1.0)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bfd10>: (0.7253812460663952, 0.830041729736677, 0.7741903180935481, 0.7585281717133001)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x1420bfcd0>: (0.8347875051269761, 0.8687756288885652, 0.8514425151306295, 0.8763250883392226)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x141b8bf90>: (0.7478187423095356, 0.8557165845714431, 0.7981375767908175, 0.7987734764277501)
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x141b8b250>: (0.7777989646831382, 0.8094668170538627, 0.7933169850872153, 0.7938751472320377)
Global model test performance : (0.4473739749370908, 0.7096070168485128, 0.54877256242702, 0.5116279069767442)



Centralized model test performance : (0.8347875051269761, 0.8687756288885652, 0.8514425151306295, 0.8763250883392226)

png

png

As you can see, when we add DP to the model, it becomes quite unstable (multiple executions each one with very different results) and almost useless (even with unacceptable values for δ\delta, that is δ0.5\delta \geq 0.5, the results are quite bad), which suggests that another way of adding DP has to be provided. An alternative approach for adding DP can be found in A differential privacy protecting K-means clustering algorithm based on contour coefficients, but it is still unclear as to how to adapt it to a federated setting.