Federated Learning tool: evaluation methodology
In this notebook, we study the different evaluation methodologies that we can use when we want to evaluate horizontal Federated Learning (hFL) simulations. First, we set up the hFL configuration as in other experiments, and we execute a training for each of the evaluation methodologies.
We refer to evaluation methodologies as the ways to evaluate the resulting models with the Sherpa.ai Federated Learning tool.
import random import numpy as np import tensorflow as tf import shfl from shfl.auxiliar_functions_for_notebooks.functionsFL import * random.seed(123) np.random.seed(seed=123) def model_builder(): model = tf.keras.models.Sequential() model.add(tf.keras.layers.Conv2D(32, kernel_size=(3, 3), padding='same', activation='relu', strides=1, input_shape=(28, 28, 1))) model.add(tf.keras.layers.MaxPooling2D(pool_size=2, strides=2, padding='valid')) model.add(tf.keras.layers.Dropout(0.4)) model.add(tf.keras.layers.Conv2D(32, kernel_size=(3, 3), padding='same', activation='relu', strides=1)) model.add(tf.keras.layers.MaxPooling2D(pool_size=2, strides=2, padding='valid')) model.add(tf.keras.layers.Dropout(0.3)) model.add(tf.keras.layers.Flatten()) model.add(tf.keras.layers.Dense(128, activation='relu')) model.add(tf.keras.layers.Dropout(0.1)) model.add(tf.keras.layers.Dense(64, activation='relu')) model.add(tf.keras.layers.Dense(10, activation='softmax')) loss = tf.keras.losses.CategoricalCrossentropy() optimizer = tf.keras.optimizers.RMSprop() metrics = [tf.keras.metrics.categorical_accuracy] return shfl.model.DeepLearningModel(model=model, loss=loss, optimizer=optimizer, metrics=metrics)
2022-03-21 11:24:46.680596: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2022-03-21 11:24:46.680613: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
To simulate a real scenario, we are distributing the Emnist dataset in a Non-IID way. Besides, we create the elements of the federated training and we make the pertinent data transformations.
#Read data database = shfl.data_base.Emnist() train_data, train_labels, test_data, test_labels = database.load_data() #Distribute among clients non_iid_distribution = shfl.data_distribution.NonIidDataDistribution(database) nodes_federation, test_data, test_labels = non_iid_distribution.get_nodes_federation(num_nodes=5, percent=10) #Set up aggregation operator aggregator = shfl.federated_aggregator.FedAvgAggregator() federated_government = shfl.federated_government.FederatedGovernment(model_builder(), nodes_federation, aggregator) #Reshape and normalize nodes_federation.apply_data_transformation(reshape_data_tf) mean = np.mean(test_data.data) std = np.std(test_data.data) nodes_federation.apply_data_transformation(normalize_data, mean=mean, std=std);
2022-03-21 11:24:50.567850: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory 2022-03-21 11:24:50.567869: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303) 2022-03-21 11:24:50.567882: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (SH-083-WS): /proc/driver/nvidia/version does not exist 2022-03-21 11:24:50.568024: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
In a real scenario, there is no global test, as all the data stays local and the server does not own any data for privacy reasons. However, we are using the capabilities of the experimental tool to show how the model behaves in a prototyping environment.
Evaluation methodology 1: global test dataset
The first evaluation methodology that we propose consists of the federated version of the classical evaluation methods. For this purpose, we use a common test dataset allocated in the server. We show the evaluation metrics (loss and accuracy, in this case) in each round of learning, both in local models and updated global model.
The behaviour of this evaluation methodology is as follows.
test_data = np.reshape(test_data, (test_data.shape, test_data.shape, test_data.shape,1)) federated_government.run_rounds(1, test_data, test_labels)
Evaluation in round 0: Collaborative model test -> Loss: 59.04448699951172, Accuracy: 0.5504249930381775 ========================
This methodology is the simplest and shows only the result of the global model tested on the global testing data. The problem with this methodology is that the local evaluation metrics are biased by the distribution of the test set data. That is, the performance of the local models is not properly represented when using a non-IID scenario because the distribution of training data for each client is different from that of the test data we use.
Nonetheless, it represents a simple way for representing the results for IID experimental scenarios.
Evaluation methodology 2: global test dataset and local test datasets
In this evaluation methodology, we consider that there is a global test dataset and that each client has a local test dataset, according to the distribution of their training data. Hence, in each round, we show the evaluation metrics of each client for their global and local tests.
This evaluation methodology is more complete as it shows the performance of the local FL models in the global and local distribution of the data, which gives us a better overview than the first methodology.
First, we split each client's data in train and test partitions using the split_train_test operation.
After that, each client owns a training set, which is used for training the local learning model and a test set, which is used to evaluate it.
We are now ready to show the behaviour of this evaluation methodology.
# We restart federated government federated_government = shfl.federated_government.FederatedGovernment(model_builder(), nodes_federation, aggregator) federated_government.run_rounds(1, test_data, test_labels)
Evaluation in round 0: Node 0: -> Global test: Loss: 846.8062133789062, Accuracy: 0.1983249932527542 -> Local test: Loss: 0.03912096098065376, Accuracy: 0.9905561208724976 Node 1: -> Global test: Loss: 162.3595733642578, Accuracy: 0.7281000018119812 -> Local test: Loss: 0.26336681842803955, Accuracy: 0.9364583492279053 Node 2: -> Global test: Loss: 614.5960693359375, Accuracy: 0.3789750039577484 -> Local test: Loss: 0.06318580359220505, Accuracy: 0.9760416746139526 Node 3: -> Global test: Loss: 172.77056884765625, Accuracy: 0.7355250120162964 -> Local test: Loss: 0.23929619789123535, Accuracy: 0.934374988079071 Node 4: -> Global test: Loss: 97.10688018798828, Accuracy: 0.7725499868392944 -> Local test: Loss: 0.353897362947464, Accuracy: 0.8864583373069763 Collaborative model test -> Loss: 56.06695556640625, Accuracy: 0.5565750002861023 ========================
We appreciate the significance of this new evaluation methodology in the output produced. For example, the first client performed the worst in the global test, while it was the best in the local test. This indicates that the data distribution of this client is most likely very poor, compared to the global data distribution, for example, which consists of only two classes. This produces a really good local learning model in just one round of learning considering that it is a simpler problem, but with a very low global test performance.
This highlights the strength of using specific evaluation methodologies in FL, especially when the distribution of data among clients follows a non-IID distribution.