Federated models: TAN TM algorithm

In this notebook, we explain how you can use a federated learning environment to create a TAN TM algorithm for a Bayesian classifier, which conducts a discriminative learning of the network parameters in order to maximize the conditional log-likelihood.

The data

First, we load the libraries and we specify gloabl variables for the federated training environment and the model.

import shfl
import numpy as np
import pandas as pd
from shfl.model.ttm_bin_model import TTM01Algorithm
from shfl.auxiliar_functions_for_notebooks.functionsFL import *
from shfl.data_base.data_base import WrapLabeledDatabase
from shfl.federated_government.federated_government import FederatedGovernment
from shfl.private.federated_operation import ServerDataNode
from shfl.private.data import DPDataAccessDefinition
from shfl.private.reproducibility import Reproducibility

Reproducibility(567)
2022-04-07 14:19:18.687517: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-04-07 14:19:18.687550: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.





<shfl.private.reproducibility.Reproducibility at 0x7f9e3539b550>

Now we load the data from the file, containing binary features and a final label.

# Load data
df = pd.read_csv("data/database_cpf2_hard_noiid.csv", sep = get_delimiter("data/database_cpf2_hard_noiid.csv"), header=None)

Let's take a look at the data available.

df.head()
0123456789...11121314151617181920
00100001100...0111100001
10100101100...1111100011
20001000100...0011100001
31110110100...1100110000
40110101100...1111011010

5 rows × 21 columns

For the TAN algorithm, we define the dependencies graph, where features are related with each other. While in the Naïve Bayes algorithm variables are independent, here we add custom dependencies in order to extend structural constraints.

TAN str

Let's load the structure we want to use:

structure = pd.read_csv("data/structure.csv",
                        sep = get_delimiter("data/structure.csv"),
                        names=["Pa1", "Pa2"],
                        header=None)

The format in which the structure is specified consists on 2 columns, specifying the first parent as the predictive variable (or label class) and the second one being the parent node (dependant variable) on the graph:

structure.head()
Pa1Pa2
1C
2CX1
3CX8
4CX3
5CX6

Before operating on the dataset, we extract the number of features in order to later configure the model.

n_features = len(df.columns) - 1

Now, in order to simulate a federated scenario, we need to distribute the data into the nodes. We are using the WrapLabeledDatabase class in order to wrap the data and the labels in a format compatible with the tool. This class also allows us to divide the global data into a global train and test for experimentation purposes.

df = df.to_numpy()
grouped_data = np.delete(df, -1, axis=1)
grouped_labels = df[:, -1]

database = WrapLabeledDatabase(grouped_data,grouped_labels)
_, _, test_data, test_labels = database.load_data()

To finish the distribution, we divide the data in an iid way between the nodes, and we apply an internal split of train and test to locally evaluate their data with the model they are training with.

iid_distribution = shfl.data_distribution.IidDataDistribution(database)
nodes_federation, test_data, test_labels = iid_distribution.get_nodes_federation(num_nodes=3, percent=100)
nodes_federation.split_train_test(0.7);

The model

Next, we define the model_builder() function to create an instance of the NB algorithm. By the implementation below, we only need to define the number of features of the dataset, previously calculated:

def model_builder():
    model = TTM01Algorithm(n_features,t_max=1, structure=structure) # t_max=t_max
    return model

Run the federated learning experiment

After defining the data and the model, we are ready to run our model in a federated configuration. le'ts define the needed components and run the training.

aggregator = shfl.federated_aggregator.FedAvgAggregator()

federated_government = FederatedGovernment(model_builder(), nodes_federation, aggregator)
federated_government.run_rounds(n_rounds=2, test_data=test_data, test_label=test_labels)
Evaluation in round 0:
########################################
Node 0:
 -> Global test accuracy:0.8815968230745115
 -> Local accuracy:0.872893533638145
Node 1:
 -> Global test accuracy:0.8801860173476852
 -> Local accuracy:0.8798171129980406
Node 2:
 -> Global test accuracy:0.8841049221444247
 -> Local accuracy:0.8786414108425865
########################################

Collaborative model test ->  0.8818580833942941


Evaluation in round 1:
########################################
Node 0:
 -> Global test accuracy:0.8810220503709897
 -> Local accuracy:0.8731548007838015
Node 1:
 -> Global test accuracy:0.8808652941791201
 -> Local accuracy:0.8802090137165252
Node 2:
 -> Global test accuracy:0.8871355418539032
 -> Local accuracy:0.882560418027433
########################################

Collaborative model test ->  0.8826941164175985
;