Federated Learning tool: evaluation methodology

In this notebook, we study the different evaluation methodologies that we can use when we want to evaluate horizontal Federated Learning (hFL) simulations. First, we set up the hFL configuration as in other experiments, and we execute a training for each of the evaluation methodologies.

We refer to evaluation methodologies as the ways to evaluate the resulting models with the Sherpa.ai Federated Learning tool.

import random
import numpy as np
import tensorflow as tf

import shfl
from shfl.auxiliar_functions_for_notebooks.functionsFL import *


def model_builder():
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Conv2D(32, kernel_size=(3, 3), padding='same', activation='relu', strides=1, input_shape=(28, 28, 1)))
    model.add(tf.keras.layers.MaxPooling2D(pool_size=2, strides=2, padding='valid'))
    model.add(tf.keras.layers.Conv2D(32, kernel_size=(3, 3), padding='same', activation='relu', strides=1))
    model.add(tf.keras.layers.MaxPooling2D(pool_size=2, strides=2, padding='valid'))
    model.add(tf.keras.layers.Dense(128, activation='relu'))
    model.add(tf.keras.layers.Dense(64, activation='relu'))
    model.add(tf.keras.layers.Dense(10, activation='softmax'))

    loss = tf.keras.losses.CategoricalCrossentropy()
    optimizer = tf.keras.optimizers.RMSprop()
    metrics = [tf.keras.metrics.categorical_accuracy]
    return shfl.model.DeepLearningModel(model=model, loss=loss, optimizer=optimizer, metrics=metrics)
2022-03-21 11:24:46.680596: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-03-21 11:24:46.680613: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

To simulate a real scenario, we are distributing the Emnist dataset in a Non-IID way. Besides, we create the elements of the federated training and we make the pertinent data transformations.

#Read data
database = shfl.data_base.Emnist()
train_data, train_labels, test_data, test_labels = database.load_data()

#Distribute among clients
non_iid_distribution = shfl.data_distribution.NonIidDataDistribution(database)
nodes_federation, test_data, test_labels = non_iid_distribution.get_nodes_federation(num_nodes=5, percent=10)

#Set up aggregation operator
aggregator = shfl.federated_aggregator.FedAvgAggregator()
federated_government = shfl.federated_government.FederatedGovernment(model_builder(), nodes_federation, aggregator)

#Reshape and normalize

mean = np.mean(test_data.data)
std = np.std(test_data.data)
nodes_federation.apply_data_transformation(normalize_data, mean=mean, std=std);
2022-03-21 11:24:50.567850: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-03-21 11:24:50.567869: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-03-21 11:24:50.567882: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (SH-083-WS): /proc/driver/nvidia/version does not exist
2022-03-21 11:24:50.568024: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

In a real scenario, there is no global test, as all the data stays local and the server does not own any data for privacy reasons. However, we are using the capabilities of the experimental tool to show how the model behaves in a prototyping environment.

Evaluation methodology 1: global test dataset

The first evaluation methodology that we propose consists of the federated version of the classical evaluation methods. For this purpose, we use a common test dataset allocated in the server. We show the evaluation metrics (loss and accuracy, in this case) in each round of learning, both in local models and updated global model.

The behaviour of this evaluation methodology is as follows.

test_data = np.reshape(test_data, (test_data.shape[0], test_data.shape[1], test_data.shape[2],1))

federated_government.run_rounds(1, test_data, test_labels)
Evaluation in round 0:

Collaborative model test -> Loss: 59.04448699951172, Accuracy: 0.5504249930381775

This methodology is the simplest and shows only the result of the global model tested on the global testing data. The problem with this methodology is that the local evaluation metrics are biased by the distribution of the test set data. That is, the performance of the local models is not properly represented when using a non-IID scenario because the distribution of training data for each client is different from that of the test data we use.

Nonetheless, it represents a simple way for representing the results for IID experimental scenarios.

Evaluation methodology 2: global test dataset and local test datasets

In this evaluation methodology, we consider that there is a global test dataset and that each client has a local test dataset, according to the distribution of their training data. Hence, in each round, we show the evaluation metrics of each client for their global and local tests.

This evaluation methodology is more complete as it shows the performance of the local FL models in the global and local distribution of the data, which gives us a better overview than the first methodology.

First, we split each client's data in train and test partitions using the split_train_test operation.


After that, each client owns a training set, which is used for training the local learning model and a test set, which is used to evaluate it.

We are now ready to show the behaviour of this evaluation methodology.

# We restart federated government
federated_government = shfl.federated_government.FederatedGovernment(model_builder(), nodes_federation, aggregator)

federated_government.run_rounds(1, test_data, test_labels)
Evaluation in round 0:
Node 0:
 -> Global test:  Loss: 846.8062133789062, Accuracy: 0.1983249932527542
 -> Local test: Loss: 0.03912096098065376, Accuracy: 0.9905561208724976
Node 1:
 -> Global test:  Loss: 162.3595733642578, Accuracy: 0.7281000018119812
 -> Local test: Loss: 0.26336681842803955, Accuracy: 0.9364583492279053
Node 2:
 -> Global test:  Loss: 614.5960693359375, Accuracy: 0.3789750039577484
 -> Local test: Loss: 0.06318580359220505, Accuracy: 0.9760416746139526
Node 3:
 -> Global test:  Loss: 172.77056884765625, Accuracy: 0.7355250120162964
 -> Local test: Loss: 0.23929619789123535, Accuracy: 0.934374988079071
Node 4:
 -> Global test:  Loss: 97.10688018798828, Accuracy: 0.7725499868392944
 -> Local test: Loss: 0.353897362947464, Accuracy: 0.8864583373069763

Collaborative model test -> Loss: 56.06695556640625, Accuracy: 0.5565750002861023

We appreciate the significance of this new evaluation methodology in the output produced. For example, the first client performed the worst in the global test, while it was the best in the local test. This indicates that the data distribution of this client is most likely very poor, compared to the global data distribution, for example, which consists of only two classes. This produces a really good local learning model in just one round of learning considering that it is a simpler problem, but with a very low global test performance.

This highlights the strength of using specific evaluation methodologies in FL, especially when the distribution of data among clients follows a non-IID distribution.