Inteligencia Artificial de Privacidad

Ver en GitHub

Sampling Methods

from shfl.data_base.emnist import Emnist

database = Emnist()
train_data, train_labels, test_data, test_labels = database.load_data()
print(len(train_data))
print(len(test_data))
print(type(train_data[0]))
train_data[0].shape
240000
40000
<class 'numpy.ndarray'>

(28, 28)
import matplotlib.pyplot as plt

plt.imshow(train_data[0])

png

import shfl
import numpy as np

class NumberOfInstances(shfl.private.FederatedTransformation):

    def apply(self, labeled_data):
        print(len(labeled_data.label))


class UniqueLabels(shfl.private.FederatedTransformation):

    def apply(self, labeled_data):
        classes = [label.argmax(-1) for label in labeled_data.label]
        print(np.unique(classes))

IID Federated Sampling

In the IID scenario, each node has independent and identically distributed access to all observations in the dataset.

The only available choices are:

  1. Percentage of the dataset used
  2. Number of instances per node
  3. Sampling with or without replacement

Percentage of the dataset used in a IID scenario

The percent parameter indicates the percentage of the total number of observations in the datased, split across the different clients. Since the subset is chosen randomly, it statistically becomes representative and follows the same distribution of the whole dataset. Obviously, it can not be greater than 100 nor less than 0.

Number of instances per node in a IID scenario

The weight parameter indicates the deterministic distribution of the number of samples per node, as a ratio over the total number of observations in the dataset used for the simulation. For instance, weights = [0.2, 0.3, 0.5] means that the first node will be assigned 20% of the total number of observations in the dataset used, the second node, 30% and the third node, 50%.

Note that the weight parameter does not necessarily sum up to one, because of the option of sampling. We specify this fact below.

Sampling with or without replacement

The sampling parameter, which can have one of the following two values, 'with_replacement' or 'without_replacement', indicates if an observation assigned to a particular node is removed from the dataset pool and will therefore be assigned only once (weight = 'without_replacement'); or will be returned to the dataset pool and can therefore be selected for a new assignation (weight = 'with_replacement').

Combinations of the weights and sampling parameters

sampling = 'without_replacement'

When sampling = 'without_replacement', the total number of samples assigned to the nodes can not be greater than the number of available observations in the dataset. This imposes the constraint on the weights parameter that the sum of the weights values must be equal to or lesser than one. If they are not, the weights will be normalized to sum one.
The possible cases are:

  1. If the sum of the weights values is less than one when sampling = 'without_replacement', then the resulting distribution of observations to the nodes (the union of the nodes' sets of samples) is a subset of the raw dataset of the whole percentage used.
from shfl.data_distribution.data_distribution_iid import IidDataDistribution

iid_distribution = IidDataDistribution(database)
federated_data, test_data, test_label = iid_distribution.get_federated_data(num_nodes=3, percent = 50,
                                                                            weights=[0.1,0.2,0.3])

print(type(federated_data))
print(federated_data.num_nodes())
print("Number of instances per client:")
shfl.private.federated_operation.apply_federated_transformation(federated_data, NumberOfInstances())
<class 'shfl.private.federated_operation.FederatedData'>
3
Number of instances per client:
12000
24000
36000
  1. If the sum of the weights values is equal to one when sampling = 'without_replacement', then the resulting distribution of observations to the nodes (the union of the nodes' sets of samples) is exactly the raw dataset, that is, the distributed samples conform a partition of the original dataset.
iid_distribution = IidDataDistribution(database)
federated_data, test_data, test_label = iid_distribution.get_federated_data(num_nodes=3, percent = 50,
                                                                            weights=[0.3,0.3,0.4])

print(type(federated_data))
print(federated_data.num_nodes())
print("Number of instances per client:")
shfl.private.federated_operation.apply_federated_transformation(federated_data, NumberOfInstances())
<class 'shfl.private.federated_operation.FederatedData'>
3
Number of instances per client:
36000
36000
48000
  1. If the sum of the weights values is greater or lesser than one when sampling = 'without_replacement', then the weights values will be normalised to sum up to one. For instance, giving sampling = 'without_replacement' and weights = [0.2, 0.3, 0.7] the sum of the weights values is 1.2 > 1, and therefore, the effective weights values will result from the normalization: weights = [0.2/1.2, 0.3/1.2, 0.7/1.2].
iid_distribution = IidDataDistribution(database)
federated_data, test_data, test_label = iid_distribution.get_federated_data(num_nodes=3, percent = 50,
                                                                            weights=[0.2,0.3,0.7])

print(type(federated_data))
print(federated_data.num_nodes())
print("Number of instances per client:")
shfl.private.federated_operation.apply_federated_transformation(federated_data, NumberOfInstances())
<class 'shfl.private.federated_operation.FederatedData'>
3
Number of instances per client:
20000
30000
70000

sampling = 'with_replacement'

When sampling = 'with_replacement', the total number of samples assigned to the nodes can be greater or lesser than the number of available observations in the dataset. This removes any constraint on the weights parameter values. The resulting distribution of samples across the nodes are subsets of the original dataset that could share observations. Also, each node could have zero, one or more than one samples of a given observation.

iid_distribution = IidDataDistribution(database)
federated_data, test_data, test_label = iid_distribution.get_federated_data(num_nodes=3, percent = 50,
                                                                            weights=[0.5,0.3,0.7],
                                                                            sampling = "with_replacement")

print(type(federated_data))
print(federated_data.num_nodes())
print("Number of instances per client:")
shfl.private.federated_operation.apply_federated_transformation(federated_data, NumberOfInstances())
<class 'shfl.private.federated_operation.FederatedData'>
3
Number of instances per client:
60000
36000
84000

Non-IID Federated Sampling

In contrast to the IID scenario, where the concept was quite clear, the data can be non-IID for several reasons :

  1. Non-identical client distributions: This is the case when data distributions from several clients do not follow the same probability distribution. The difference in probability distributions can be due to several factors:

    1.1. Feature distribution skew: When data features of several clients follow different probability distributions. This case is typical for personal data, such as handwritten digits.

    1.2. Label distribution skew: When label distribution varies across different clients. This kind of skew is typical for area-dependent data (species existing in a certain place).

    1.3. Concept shift: When data features with the same label differ across different clients (same features, different label), i.e., due to cultural differences or when labels from data with the same features differ across different clients (same label, different features), i.e., due to personal preferences.

    1.4. Unbalancedness: It is common for the amount of data to vary significantly between clients.

  2. Non-independent client distributions: When the distribution of data from some clients somehow depends on the distribution of data from another. For example, cross-device FL experiments are performed at night, local time, which causes geographic bias in the data.
  3. Non-identical and non-independent distributions: In real FL scenarios, data may be non-IID for several reasons simultaneously, due to the particular nature of the data source.

As we have explained, the reasons why a dataset may be non-IID are manifold. At the moment, the framework implements label distribution skew. For each client, we randomly choose the number of labels it knows and which ones they are. We show the labels known by each client.

In this case, the options available are the same and have the same meaning as in the IID. sampling. According to the sampling parameter, when sampling = 'without_replacement', due to non-IID restrictions (clients with a reduced number of known labels), it is possible that some clients will receive less data than specified by the weights parameter, due to the lack of data from a certain label. This is also possible when sampling = 'with_replacement', but is less likely, due to the fact that we can reuse data from some labels. It will only occur if the amount of data assigned to a client is greater than the total amount of data from the labels.

Here, we show the difference of amount of data of each client with and without replacement sampling option:

from shfl.data_distribution.data_distribution_non_iid import NonIidDataDistribution

non_iid_distribution = NonIidDataDistribution(database)
federated_data, test_data, test_label = non_iid_distribution.get_federated_data(num_nodes=3, percent = 100,
                                                                            weights=[0.2,0.3,0.2])

print(type(federated_data))
print(federated_data.num_nodes())
print("Number of instances per client:")
shfl.private.federated_operation.apply_federated_transformation(federated_data, NumberOfInstances())
<class 'shfl.private.federated_operation.FederatedData'>
3
Number of instances per client:
48000
56118
27176
from shfl.data_distribution.data_distribution_non_iid import NonIidDataDistribution

non_iid_distribution = NonIidDataDistribution(database)
federated_data, test_data, test_label = non_iid_distribution.get_federated_data(num_nodes=3, percent = 100,
                                                                            weights=[0.2,0.3,0.2],
                                                                                sampling="with_replacement")

print(type(federated_data))
print(federated_data.num_nodes())
print("Number of instances per client:")
shfl.private.federated_operation.apply_federated_transformation(federated_data, NumberOfInstances())
<class 'shfl.private.federated_operation.FederatedData'>
3
Number of instances per client:
48000
72000
48000

Let's see the known labels for each client, in order to show the label distribution skew.

print("Known labels per client:")
shfl.private.federated_operation.apply_federated_transformation(federated_data, UniqueLabels())
Known labels per client:
[3 4 7]
[0 1 3 4 7 8 9]
[2 3]