Privacy Artificial Intelligence

View on GitHub

Evaluation Methodology

In this notebook, we study the different evaluation methodologies that we can use when we want to evaluate FL problems. First, we set up the FL configuration (for more information see A Simple Experiment).

import shfl
import tensorflow as tf
import numpy as np
import random


class Reshape(shfl.private.FederatedTransformation):

    def apply(self, labeled_data): = np.reshape(, ([0],[1],[2],1))

class Normalize(shfl.private.FederatedTransformation):

    def __init__(self, mean, std):
        self.__mean = mean
        self.__std = std

    def apply(self, labeled_data): = ( - self.__mean)/self.__std

def model_builder():
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Conv2D(32, kernel_size=(3, 3), padding='same', activation='relu', strides=1, input_shape=(28, 28, 1)))
    model.add(tf.keras.layers.MaxPooling2D(pool_size=2, strides=2, padding='valid'))
    model.add(tf.keras.layers.Conv2D(32, kernel_size=(3, 3), padding='same', activation='relu', strides=1))
    model.add(tf.keras.layers.MaxPooling2D(pool_size=2, strides=2, padding='valid'))
    model.add(tf.keras.layers.Dense(128, activation='relu'))
    model.add(tf.keras.layers.Dense(64, activation='relu'))
    model.add(tf.keras.layers.Dense(10, activation='softmax'))

    model.compile(optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"])

    return shfl.model.DeepLearningModel(model)
#Read data
database = shfl.data_base.Emnist()
train_data, train_labels, test_data, test_labels = database.load_data()

#Distribute among clients
non_iid_distribution = shfl.data_distribution.NonIidDataDistribution(database)
federated_data, test_data, test_labels = non_iid_distribution.get_federated_data(num_nodes=5, percent=10)

#Set up aggregation operator
aggregator = shfl.federated_aggregator.FedAvgAggregator()
federated_government = shfl.federated_government.FederatedGovernment(model_builder, federated_data, aggregator)

#Reshape and normalize
shfl.private.federated_operation.apply_federated_transformation(federated_data, Reshape())

mean = np.mean(
std = np.std(
shfl.private.federated_operation.apply_federated_transformation(federated_data, Normalize(mean, std))

Evaluation Methodology 1: Global Test Dataset

The first evaluation methodology that we propose consists of the federated version of the classical evaluation methods. For this purpose, we use a common test dataset allocated in the server. We show the evaluation metrics (loss and accuracy, in this case) in each round of learning, both in local models and the updated global model. The behaviour of this evaluation methodology is as follows:

test_data = np.reshape(test_data, (test_data.shape[0], test_data.shape[1], test_data.shape[2],1))
federated_government.run_rounds(1, test_data, test_labels)
Accuracy round 0
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x144f05090>: [1054.660888671875, 0.19827499985694885]
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x144effed0>: [269.2410583496094, 0.5523999929428101]
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x108249150>: [684.6412963867188, 0.2964499890804291]
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x108249390>: [246.00254821777344, 0.5353999733924866]
Test performance client <shfl.private.federated_operation.FederatedDataNode object at 0x108224c50>: [142.7078857421875, 0.53267502784729]
Global model test performance : [46.26216125488281, 0.5289499759674072]

This methodology is the simplest and shows both local and global models. The problem with this methodology is that the local evaluation metrics are biased by the distribution of the test set data. That is, the performance of the local models is not properly represented when using a non-IID scenario (see Federated Sampling) because the distribution of training data for each client is different from that of the test data we use. For that reason, we propose the following evaluation methodology.

Evaluation Methodology 2: Global Test Dataset and Local Test Datasets

In this evaluation methodology, we consider that there is a global test dataset and that each client has a local test dataset, according to the distribution of their training data. Hence, in each round, we show the evaluation metrics of each client for their global and local tests. This evaluation methodology is more complete, as it shows the performance of the local FL models in the global and local distribution of the data, which gives us more information.

First, we split each client's data in train and test partitions. You can find this method in Federated Operation.


After that, each client owns a training set, which is used for training the local learning model and a test set, which is used to evaluate it.

We are now ready to show the behaviour of this evaluation methodology.

# We restart federated government
federated_government = shfl.federated_government.FederatedGovernment(model_builder, federated_data, aggregator)

test_data = np.reshape(test_data, (test_data.shape[0], test_data.shape[1], test_data.shape[2],1))
federated_government.run_rounds(1, test_data, test_labels)
Accuracy round 0
Performance client <shfl.private.federated_operation.FederatedDataNode object at 0x144f05090>: Global test: [1101.955078125, 0.19705000519752502], Local test: [0.04107693210244179, 0.987500011920929]
Performance client <shfl.private.federated_operation.FederatedDataNode object at 0x144effed0>: Global test: [221.7073974609375, 0.5309000015258789], Local test: [0.25086769461631775, 0.9440104365348816]
Performance client <shfl.private.federated_operation.FederatedDataNode object at 0x108249150>: Global test: [587.9822998046875, 0.28905001282691956], Local test: [0.08768923580646515, 0.9674267172813416]
Performance client <shfl.private.federated_operation.FederatedDataNode object at 0x108249390>: Global test: [239.7751007080078, 0.5080249905586243], Local test: [0.3157331645488739, 0.8961303234100342]
Performance client <shfl.private.federated_operation.FederatedDataNode object at 0x108224c50>: Global test: [125.70452880859375, 0.5903000235557556], Local test: [0.7054284811019897, 0.8091602921485901]
Global model test performance : [57.04563903808594, 0.4162749946117401]

We appreciate the significance of this new evaluation methodology in the output produced. For example, the first client performed the worst in the global test, while it was the best in the local test. This indicates that the data distribution of this client is most likely very poor, compared to the global data distribution, for example, which consists of only two classes. This produces a really good local learning model in just one round of learning, being that it is a simpler problem, but with a very low global test performance.

This highlights the strength of using specific evaluation methodologies in FL, especially when the distribution of data among clients follows a non-IID distribution (see Sampling Methods).