Vertical Federated Learning with Sherpa.ai platform using PSI
Federated Learning is a Machine Learning paradigm aimed at learning models from decentralized data, such as data located on users’ smartphones, in hospitals, or banks, and ensuring data privacy. This is achieved by training the model locally in each node (e.g., on each smartphone, at each hospital, or at each bank), sharing the model-updated local parameters (not the data) and securely aggregating them to build a better global model.
Traditional Machine Learning requires all the data to be gathered in one single place. In practice, this is often forbidden by privacy regulations. For this reason, Federated Learning is introduced, the goal being to learn from a large amount of data, while preserving privacy.
The supported Federated Learning categories are: Horizontal, Vertical and Transfer. In this notebook, we will train by Vertical Federated Learning, where the nodes share overlapping samples (share the same sample ID space) but differ in data features. VFL employs this heterogeneity to train a more accurate model. The main idea to do this is to split a Neural Network among different parties and a server.
VFL requires the nodes to share overlapping samples, i.e. the two clients must have the same samples in the same oder (the first row of the first client must match with the first row of the second client, etcetera...). In practice, this assumption is not verified. To reduce to such assumption, we perform Private Set Intersection (PSI) or Privacy Preserving Entity Resolution (PPER).
In this procedure, encrypted identifiers (e.g. email, card ID...) are shared between the parties to enable each company to link the customers they have in common before the training of the local models. As only encrypted identifiers are shared, both customers’ identifiers and the other private features (e.g. age, salary, postal code) of each company are kept safe. The identifier can be a composition of two or more variables with a certain transformation. This can lead to more security as the raw identifier is not shared.
Nevertheless, Company A may be able to know which clients it has in common with Company B. The same applies the other way around. Technically speaking, PSI would make sample identifiers of the intersection visible to all parties, and therefore each party can know that the data entities shown in the intersection also appear in the other parties. Nonetheless, neither of the companies can obtain the data from the other company, information regarding the clients stays safe.
On the one hand, there are cases case where this membership information leakage is allowed between the companies. On the other hand, In some other cases, this membership information is sensitive and must be protected because of privacy standards and regulations.
In our case, we assume that the Insurance company and the Telco companu have the agreement of knowing the sample ID of the intersection.
A Telco and an Insurance company wants to share information while complying with privacy regulations. In this notebook we will simulate a fictional scenario where a insurance company and a Telco want to collaborate to train a model using Sherpa.ai's platform in a private way that will allow to predict if a person contracts a life insurance. a insurance company has poor data that does not allow him to perform good predictions. With the knowledge of the insurance, the telco will improve its predictions. So we are trying to predict both the probability of death and purchase, using private client data from the telco and the insurance, without the companies sharing each other’s data. The general description of the problem is:
VFL requires the nodes to share overlapping samples, i.e. the two clients must have the same samples in the same oder (the first row of the first client must match with the first row of the second clients, etcetera...). In practice, this assumption is not verified. To reduce to such assumption, we perform Private Set Intersection (PSI) or Privacy Preserving Entity Resolution (PPER).
Once that we have a general overview of our problem, the procedure is the following:
Index
As we can see, the preprocessing in this case is done before we do the PSI.
0) Libraries and data
import warnings
import matplotlib
matplotlib.use('Agg')
import tensorflow as tf
from shfl.auxiliar_functions_for_notebooks import intersection_federated_government
from shfl.auxiliar_functions_for_notebooks.functionsFL import *
from shfl.auxiliar_functions_for_notebooks.node_initialization import nodes_list, nodes_federation
from shfl.auxiliar_functions_for_notebooks.preprocessing import *
from shfl.data_base.data_base import split_train_test
from shfl.federated_aggregator import FedSumAggregator
from shfl.federated_government.vertical_federated_government import VerticalFederatedGovernment
from shfl.model.vertical_deep_learning_model_tf import VerticalNeuralNetClientModelTensorFlow
from shfl.model.vertical_deep_learning_model_tf import VerticalNeuralNetServerModelTensorFlow
from shfl.private.reproducibility import Reproducibility
from tensorflow import keras
plt.style.use('seaborn')
pd.set_option("display.max_rows", 30, "display.max_columns", None)
warnings.filterwarnings('ignore')
Reproducibility(567)
2022-05-10 18:03:40.341207: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-05-10 18:03:40.341228: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
/home/f.palomino/Desktop/venvpruebas/environment/lib/python3.8/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
1 nodes_list= <shfl.private.federated_operation.HeterogeneousDataNode object at 0x7f70e744fb50>
1 nodes_list= <shfl.private.federated_operation.HeterogeneousDataNode object at 0x7f70e73fe730>
1 nodes_list= <shfl.private.federated_operation.VerticalServerDataNode object at 0x7f70e744fd90>
<shfl.private.reproducibility.Reproducibility at 0x7f71e4414130>
We load the data of the insurance company and the telco company:
insurance = pd.read_csv("./data_from_insurance.csv", sep=",")
telco = pd.read_csv("./data_from_telco.csv", sep=",")
- The data from the insurance company:
insurance.head()
ID_1 | ID_2 | COMUNIDAD | SEXO | EDUCATION | TOMA_MEDIC | EDAD | IMC | ESTANCIA_HOSPITAL | VIDA | AUTO | HOGAR | TARGET | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6078 | 6078 | 100000 | 2 | 2 | 2 | 26 | 0 | 0 | 0 | -1 | 0 | 0 |
1 | 22466 | 22466 | 500000 | 2 | 3 | 2 | 44 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 603 | 603 | 170000 | 1 | 1 | 1 | 53 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 14730 | 14730 | 200000 | 1 | 2 | 1 | 40 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 16210 | 16210 | 20000 | 2 | 2 | 1 | 46 | 2 | 0 | 0 | 2 | 2 | 1 |
- The data from the telco company:
telco.head()
ID_1 | ID_2 | TV | N_PRODUCTOS | NIVEL_PASOS | SANA | INFANTIL | FILTRADO | FACTURA | CAMBIO_DIRECCION | REDUCCION_LINEAS | AX_TIPO_FAMILIA | AX_MIEMBROS_FAMILIA | SEGURO_MOVIL | BROWSE_CATGRP.1 | BROWSE_CATGRP.2 | PERMISOS_PREF | PERMISOS_CORREO | PERMISOS_EMAIL | PERMISOS_DATOS | PERMISOS_TERCEROS | PERMISOS_GUIA | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20288 | 20288 | -2 | -2 | -1 | -1 | -1 | -99 | 0 | 0 | 1350 | 0 | 1265 | 0 | 85 | 58 | 0 | 1350 | 0 | 1265 | 0 | 0 |
1 | 14918 | 14918 | 1 | -2 | -2 | -2 | -2 | 124 | 0 | 0 | 0 | 0 | 0 | 0 | 60 | -15 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 14834 | 14834 | 2 | 0 | 0 | 0 | 0 | 16 | 46868 | 42397 | 38569 | 33570 | 19895 | 30125 | 58 | -56 | 1800 | 2200 | 1400 | 2000 | 30000 | 997 |
3 | 27974 | 27974 | -1 | -1 | -1 | -1 | -1 | 17 | 2848 | 1469 | 1647 | 1355 | 1458 | 1631 | -86 | -90 | 1469 | 1675 | 1355 | 1458 | 1631 | 1931 |
4 | 28217 | 28217 | 0 | 0 | 0 | 0 | 0 | -124 | 69817 | 71272 | 76740 | 67480 | 68843 | 70266 | -127 | -58 | 2049 | 6104 | 1839 | 1862 | 1918 | 2149 |
1) Prepare the data before doing the vertical federated learning scenario with PSI
In this notebook, we are going to suppose that both the insurance company and the telco company preprocess its data before using the Sherpa.ai Federated Learning framework. So we process as follow:
print('There are {} observations with {} features from the insurance company.'.format(insurance.shape[0], insurance.shape[1]))
print('There are {} observations with {} features from the telco company.'.format(telco.shape[0], telco.shape[1]))
There are 26100 observations with 13 features from the insurance company.
There are 26100 observations with 22 features from the telco company.
1.1) Preprocessing of the insurance company
the insurance company selects its categorical and numerical variables:
categorical_insurance = ["COMUNIDAD", "SEXO", "EDUCATION", "TOMA_MEDIC", "IMC", "ESTANCIA_HOSPITAL", "VIDA", "AUTO", "HOGAR"]
numerical_insurance = ["EDAD"]
The variable to predict is TARGET
:
target = insurance.TARGET
labels = target.to_numpy().reshape(-1,1)
We do not want to alter the ID since we need it to perform the PSI. Furthermore, the insurance company one-hot its categorical variables and normalized its numerical variables.
cat_insurance = one_hot_encode(insurance[categorical_insurance], categorical_insurance)
num_insurance = normalize_data(insurance[numerical_insurance])
insurance_without_id = pd.concat([cat_insurance, num_insurance], axis=1)
id1_insurance=insurance.ID_1
id2_insurance=insurance.ID_2
insurance = pd.concat([pd.DataFrame(id1_insurance), pd.DataFrame(id2_insurance)], axis=1)
insurance = pd.concat([insurance, pd.DataFrame(insurance_without_id)], axis=1)
insurance = pd.concat([insurance, target], axis=1)
del categorical_insurance, numerical_insurance, cat_insurance, num_insurance, insurance_without_id, id1_insurance, id2_insurance
Finally the data from the insurance company is:
insurance.head()
ID_1 | ID_2 | COMUNIDAD_16000 | COMUNIDAD_20000 | COMUNIDAD_30000 | COMUNIDAD_40000 | COMUNIDAD_50000 | COMUNIDAD_60000 | COMUNIDAD_70000 | COMUNIDAD_80000 | COMUNIDAD_90000 | COMUNIDAD_100000 | COMUNIDAD_110000 | COMUNIDAD_120000 | COMUNIDAD_130000 | COMUNIDAD_140000 | COMUNIDAD_150000 | COMUNIDAD_160000 | COMUNIDAD_170000 | COMUNIDAD_180000 | COMUNIDAD_190000 | COMUNIDAD_200000 | COMUNIDAD_210000 | COMUNIDAD_220000 | COMUNIDAD_230000 | COMUNIDAD_240000 | COMUNIDAD_250000 | COMUNIDAD_260000 | COMUNIDAD_270000 | COMUNIDAD_280000 | COMUNIDAD_290000 | COMUNIDAD_300000 | COMUNIDAD_310000 | COMUNIDAD_320000 | COMUNIDAD_327680 | COMUNIDAD_330000 | COMUNIDAD_340000 | COMUNIDAD_350000 | COMUNIDAD_360000 | COMUNIDAD_370000 | COMUNIDAD_380000 | COMUNIDAD_390000 | COMUNIDAD_400000 | COMUNIDAD_410000 | COMUNIDAD_420000 | COMUNIDAD_430000 | COMUNIDAD_440000 | COMUNIDAD_450000 | COMUNIDAD_460000 | COMUNIDAD_470000 | COMUNIDAD_480000 | COMUNIDAD_490000 | COMUNIDAD_500000 | COMUNIDAD_510000 | COMUNIDAD_520000 | COMUNIDAD_530000 | COMUNIDAD_540000 | COMUNIDAD_550000 | COMUNIDAD_560000 | COMUNIDAD_570000 | COMUNIDAD_580000 | COMUNIDAD_590000 | COMUNIDAD_600000 | COMUNIDAD_610000 | COMUNIDAD_620000 | COMUNIDAD_630000 | COMUNIDAD_640000 | COMUNIDAD_650000 | COMUNIDAD_660000 | COMUNIDAD_670000 | COMUNIDAD_680000 | COMUNIDAD_690000 | COMUNIDAD_700000 | COMUNIDAD_710000 | COMUNIDAD_720000 | COMUNIDAD_730000 | COMUNIDAD_740000 | COMUNIDAD_750000 | COMUNIDAD_780000 | COMUNIDAD_800000 | COMUNIDAD_1000000 | SEXO_2 | EDUCATION_1 | EDUCATION_2 | EDUCATION_3 | EDUCATION_4 | EDUCATION_5 | EDUCATION_6 | TOMA_MEDIC_1 | TOMA_MEDIC_2 | TOMA_MEDIC_3 | IMC_-1 | IMC_0 | IMC_1 | IMC_2 | IMC_3 | IMC_4 | IMC_5 | IMC_6 | IMC_7 | IMC_8 | ESTANCIA_HOSPITAL_-1 | ESTANCIA_HOSPITAL_0 | ESTANCIA_HOSPITAL_1 | ESTANCIA_HOSPITAL_2 | ESTANCIA_HOSPITAL_3 | ESTANCIA_HOSPITAL_4 | ESTANCIA_HOSPITAL_5 | ESTANCIA_HOSPITAL_6 | ESTANCIA_HOSPITAL_7 | ESTANCIA_HOSPITAL_8 | VIDA_-1 | VIDA_0 | VIDA_1 | VIDA_2 | VIDA_3 | VIDA_4 | VIDA_5 | VIDA_6 | VIDA_7 | VIDA_8 | AUTO_-1 | AUTO_0 | AUTO_1 | AUTO_2 | AUTO_3 | AUTO_4 | AUTO_5 | AUTO_6 | AUTO_7 | AUTO_8 | HOGAR_-1 | HOGAR_0 | HOGAR_2 | HOGAR_3 | HOGAR_4 | HOGAR_5 | HOGAR_6 | HOGAR_7 | HOGAR_8 | EDAD | TARGET | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6078 | 6078 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1.030351 | 0 |
1 | 22466 | 22466 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.919181 | 0 |
2 | 603 | 603 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.893947 | 0 |
3 | 14730 | 14730 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.485951 | 0 |
4 | 16210 | 16210 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1.135796 | 1 |
The server in this case corresponds to the insurance company, since it is the one that has the labels:
data_server = insurance[["ID_1", "ID_2", "TARGET"]]
1.2) Preprocessing of the telco company
the telco company does the same process:
id1_telco=telco.ID_1
id2_telco=telco.ID_2
categorical_telco = ["TV", "N_PRODUCTOS", "NIVEL_PASOS", "SANA", "INFANTIL"]
numerical_telco = ['FILTRADO','FACTURA', 'CAMBIO_DIRECCION', 'REDUCCION_LINEAS', 'AX_TIPO_FAMILIA', 'AX_MIEMBROS_FAMILIA', 'SEGURO_MOVIL',
'BROWSE_CATGRP.1', 'BROWSE_CATGRP.2', 'PERMISOS_PREF', 'PERMISOS_CORREO', 'PERMISOS_EMAIL', 'PERMISOS_DATOS',
'PERMISOS_TERCEROS', 'PERMISOS_GUIA']
cat_telco = one_hot_encode(telco[categorical_telco], categorical_telco)
num_telco = normalize_data(telco[numerical_telco])
telco_without_id = pd.concat([cat_telco, num_telco], axis=1)
telco = pd.concat([pd.DataFrame(id1_telco), pd.DataFrame(id2_telco)], axis=1)
telco = pd.concat([telco, telco_without_id], axis=1)
telco.head()
ID_1 | ID_2 | TV_-1 | TV_0 | TV_1 | TV_2 | TV_3 | TV_4 | TV_5 | TV_6 | TV_7 | TV_8 | N_PRODUCTOS_-1 | N_PRODUCTOS_0 | N_PRODUCTOS_1 | N_PRODUCTOS_2 | N_PRODUCTOS_3 | N_PRODUCTOS_4 | N_PRODUCTOS_5 | N_PRODUCTOS_6 | N_PRODUCTOS_7 | N_PRODUCTOS_8 | NIVEL_PASOS_-1 | NIVEL_PASOS_0 | NIVEL_PASOS_1 | NIVEL_PASOS_2 | NIVEL_PASOS_3 | NIVEL_PASOS_4 | NIVEL_PASOS_5 | NIVEL_PASOS_6 | NIVEL_PASOS_7 | NIVEL_PASOS_8 | SANA_-1 | SANA_0 | SANA_1 | SANA_2 | SANA_3 | SANA_4 | SANA_5 | SANA_6 | SANA_7 | SANA_8 | INFANTIL_-1 | INFANTIL_0 | INFANTIL_2 | INFANTIL_3 | INFANTIL_4 | INFANTIL_5 | INFANTIL_6 | INFANTIL_7 | INFANTIL_8 | FILTRADO | FACTURA | CAMBIO_DIRECCION | REDUCCION_LINEAS | AX_TIPO_FAMILIA | AX_MIEMBROS_FAMILIA | SEGURO_MOVIL | BROWSE_CATGRP.1 | BROWSE_CATGRP.2 | PERMISOS_PREF | PERMISOS_CORREO | PERMISOS_EMAIL | PERMISOS_DATOS | PERMISOS_TERCEROS | PERMISOS_GUIA | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20288 | 20288 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1.313959 | -0.693231 | -0.689296 | -0.663520 | -0.672619 | -0.643676 | -0.650920 | 1.452346 | 1.053028 | -0.343620 | -0.224189 | -0.292022 | -0.227232 | -0.314737 | -0.293528 |
1 | 14918 | 14918 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2.036227 | -0.693231 | -0.689296 | -0.683218 | -0.672619 | -0.664557 | -0.650920 | 1.076008 | -0.046004 | -0.343620 | -0.292072 | -0.292022 | -0.306592 | -0.314737 | -0.293528 |
2 | 14834 | 14834 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.413715 | -0.058300 | -0.093910 | -0.120454 | -0.149258 | -0.336151 | -0.145739 | 1.045901 | -0.663269 | -0.234100 | -0.181448 | -0.213920 | -0.181121 | 1.638393 | -0.237855 |
3 | 27974 | 27974 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.428738 | -0.654648 | -0.668666 | -0.659186 | -0.651495 | -0.640490 | -0.623569 | -1.121806 | -1.175147 | -0.254239 | -0.207847 | -0.216431 | -0.215124 | -0.208552 | -0.185699 |
4 | 28217 | 28217 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1.689541 | 0.252595 | 0.311585 | 0.436503 | 0.379404 | 0.471832 | 0.527405 | -1.739000 | -0.693379 | -0.218949 | 0.014860 | -0.189430 | -0.189779 | -0.189867 | -0.173526 |
del id1_telco, id2_telco, telco_without_id, cat_telco, num_telco, categorical_telco, numerical_telco
2) PSI
To do the entity matching, we have to select a variable (or some variables) which uniquely correspond to the same instance. This matching is going to be done through a variable ID_1
and ID_2
, that we will add to our datasets.
There will be samples that are exclusive for the insurance company, and other samples exclusives for the telco company. The objetive is to match those variable that are common between the clients. If there are not the same number of instances for the insurance company node and the the telco company node, with the PSI, we will work only with the intersection of those observations, this is, the instances that are both in the insurance company and the telco company.
2.1) Execute PSI
In node_initialization
, nodes in nodes_list
are endowed with tools (functions, data structures) to perform hashing and PSI and Vertical Federated Learning (VFL).
Note that the first node (client node) and the last node (server node) are supposed to be mounted in the same physical node (the insurance company, who has the labels).
nodes_list[0].set_private_data(insurance)
nodes_list[1].set_private_data(telco)
nodes_list[2].set_private_data(data_server)
As we have presented, PSI is a technique aimed at determining the intersection of two private sets, without sharing the elements of such sets.
In our case, the goal is to determine the intersection of identifiers sets and (e.g. e-mail, ID-card number or name) of samples of datasets owned by different parties, without sharing identifiers values. Indeed, an identifier may be private.
Once the intersection is determined, intersecting identifiers are ordered, so aligning datasets belonging to different organizations.
p = 1048343
feast_list=["ID_1", "ID_2"]
intersection_federated_government.run_intersection(nodes_list, feast_list, p)
STEP 1. Hash identifiers onto Z_p.
STEP 2. First checks and parameters definition.
STEP 3. Send encrypted identifiers to the server.
STEP 4. Compute intersection hashed identifiers.
Since the instances are not aligned, with PSI will be in charge of align the matches in a private preserving manner. So, the datasets will be synchronized.
n=2
for k in range(n + 1):
nodes_list[k].call('synchronize_dataset')
2.2) Compare PSI with a non-private Set Intersection
Now we will obtain the intersection manually without caring about privacy. This will help us later to evaluate how accurate the PSI was performed in comparison with a non-private intersection:
intersected_centralized_data = insurance.merge(telco, how = 'inner', on = ['ID_1', 'ID_2'])
The SI without any kind of privacy, gives that the number of elements that are common in the insurance company and the telco company are:
#np means no privacy
inters_np = intersected_centralized_data["ID_1"].tolist()
len(inters_np)
22708
This means that the porcentaje of data of each node that were not in common with the other party are:
round(100-(len(inters_np)*100 / insurance.shape[0]),3)
12.996
round(100-(len(inters_np)*100 / telco.shape[0]),3)
12.996
This means that not all the data of each party will be used in the training. Just the data that is intersected.
To have a benchmark of the performance of federated learning in comparison with a non-private centralized case, we are going to join this already shuffled and intersected datasets. Obviously, this is done just for study and must be forbidden in a real world scenario.
labels_insurance_for_comparison = nodes_list[0].query()["TARGET"]
labels_insurance_for_comparison = label_encoder(labels_insurance_for_comparison)
data_insurance_for_comparison = nodes_list[0].query().drop(["ID_1", "ID_2", "TARGET"], axis=1)
data_telco_for_comparison = nodes_list[1].query().drop(["ID_1", "ID_2"], axis=1)
centralized_datasets = pd.concat([data_insurance_for_comparison.reset_index(drop=True), data_telco_for_comparison], axis=1)
del data_telco_for_comparison
2.3) Data preprocessing
Now that the data are correctly aligned we should preprocess it. The preprocessing will consist in:
- Drop the ID variable since it is only used to do the matching
- If necessary, split the data into inputs and labels.
That will be done with drop_id_and_split_label
, a function inside the nodes:
nodes_list[0].call('drop_id_and_split_label', label_name='TARGET', id_name=feast_list)
nodes_list[1].call('drop_id_and_split_label', label_name='TARGET', id_name=feast_list)
nodes_list[2].call('drop_id_and_split_label', label_name='TARGET', id_name= feast_list)
And as the final step, we can internally split the data in train and test in each node. In PSI, the cardinality of the intersection is known for the clients. As they have synchronized their datasets, the orchestator can send the order to take a portion for training and a portion of testing for each client. In this case, the portion will be the 80% first columns:
nodes_list[0].call('split_train_test', train_proportion=0.8)
nodes_list[1].call('split_train_test', train_proportion=0.8)
nodes_list[2].call('split_train_test', train_proportion=0.8, is_label=True)
3) Run the experiment
Now we are going to execute the federated, local and centralized experiments in order to compare the results and illustrate how the model's metrics behave in these scenarios. As in the notebook explaining the basic concepts of vFL, we are going to emulate the process of creating the whole structure and we will train the models.
3.1) Federated
The key idea of VFL is to enhance a learning model by utilizing the distributed data with the different attributes of the different parties. Hence, VFL has vertically partitioned data where participants' data share the same sample space with a different attribute space.
In the remainder, we are going to test our privacy-preserving Federated Learning technology. In the spirit of privacy preservation, we set up a more strict access policy.
def privacy_preserving_query(dataset):
"""Returns only the number of columns of dataset. """
return dataset.data.shape[1]
n = 2
for i in range(n + 1):
nodes_list[i].configure_data_access(privacy_preserving_query)
We need to do this to give the models its input data shape.
client_out_dim = 32
model0 = tf.keras.models.Sequential()
model0.add(tf.keras.layers.Dense(client_out_dim, activation='relu', kernel_initializer='he_normal', input_shape=(nodes_list[0].query(),)))
model1 = tf.keras.models.Sequential()
model1.add(tf.keras.layers.Dense(client_out_dim, activation='relu', kernel_initializer='he_normal', input_shape=(nodes_list[1].query(),)))
optimizer0 = tf.keras.optimizers.RMSprop(learning_rate=0.0001)
optimizer1 = tf.keras.optimizers.RMSprop(learning_rate=0.0001)
batch_size = 32
model_nodes = [VerticalNeuralNetClientModelTensorFlow(model=model0, loss=None, optimizer=optimizer0, batch_size=batch_size),
VerticalNeuralNetClientModelTensorFlow(model=model1, loss=None, optimizer=optimizer1, batch_size=batch_size)]
2022-05-10 18:03:54.444159: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-05-10 18:03:54.444194: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-05-10 18:03:54.444227: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (sh-015-ws): /proc/driver/nvidia/version does not exist
2022-05-10 18:03:54.444547: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
# Define the model of the server node
model_server = tf.keras.models.Sequential()
model_server.add(tf.keras.layers.Dense(64, activation='relu', kernel_initializer='he_normal', input_shape=(client_out_dim,)))
model_server.add(tf.keras.layers.Dense(32, activation='relu', kernel_initializer='he_normal'))
model_server.add(tf.keras.layers.Dense(1, activation='sigmoid'))
loss_server = tf.keras.losses.BinaryCrossentropy(from_logits=False)
optimizer_server = tf.keras.optimizers.RMSprop(learning_rate=0.0001)
model = VerticalNeuralNetServerModelTensorFlow(model_server, loss_server, optimizer_server,
metrics=[tf.keras.metrics.AUC(num_thresholds=60000)])
# Set the model and the aggregator in the server node
nodes_list[2].set_model(model)
nodes_list[2].set_aggregator(FedSumAggregator())
# Configure data access to nodes and server
nodes_federation.configure_model_access(meta_params_query)
nodes_list[2].configure_model_access(meta_params_query)
nodes_list[2].configure_data_access(train_set_evaluation)
# Convert to float
nodes_federation.apply_data_transformation(cast_to_float);
# Create federated government
federated_government = VerticalFederatedGovernment(model_nodes,
nodes_federation,
server_node=nodes_list[2])
# Run training and testing
federated_government.run_rounds(n_rounds=100001,
eval_freq=10000)
Evaluation in round 0 :
loss: 0.5725527405738831 auc: 0.6032233238220215
Evaluation in round 10000 :
loss: 1.6924785375595093 auc: 0.6986650228500366
Evaluation in round 20000 :
loss: 2.4987637996673584 auc: 0.7234680652618408
Evaluation in round 30000 :
loss: 2.7209153175354004 auc: 0.7400737404823303
Evaluation in round 40000 :
loss: 2.5503551959991455 auc: 0.7559691667556763
Evaluation in round 50000 :
loss: 2.406587839126587 auc: 0.7721993327140808
Evaluation in round 60000 :
loss: 2.7810895442962646 auc: 0.7838234901428223
Evaluation in round 70000 :
loss: 3.3195133209228516 auc: 0.7927149534225464
Evaluation in round 80000 :
loss: 3.5789554119110107 auc: 0.7998210191726685
Evaluation in round 90000 :
loss: 3.9660234451293945 auc: 0.8050363659858704
Evaluation in round 100000 :
loss: 4.442634105682373 auc: 0.8091144561767578
predictions_fed = nodes_list[2].call('s_plot_roc')
3.2) Local (Data from the insurance company only)
The local case refers to the situation where we only have the data belonging to one of the clients, which in this case is the insurance company. This will be useful to understand how the Telco data improves the metric of the model obtained by using only the data of the the insurance company. We have to comment that the data should always remain inside the node. As this is an experimental notebook to illustrate the federated experiment and doing a comparison, we will use the test data of the local party (the the insurance company), but keep in mind that this operation is done locally by one party.
local_data = data_insurance_for_comparison.to_numpy()
del data_insurance_for_comparison
train_data_insurance, train_labels, test_data_insurance, test_labels = split_train_test(local_data,
labels_insurance_for_comparison)
model_loc = tf.keras.models.Sequential()
model_loc.add(tf.keras.layers.Dense(32, activation='relu', kernel_initializer='he_normal', input_shape=(train_data_insurance.shape[1],)))
model_loc.add(tf.keras.layers.Dense(64, activation='relu', kernel_initializer='he_normal'))
model_loc.add(tf.keras.layers.Dense(32, activation='relu', kernel_initializer='he_normal'))
model_loc.add(tf.keras.layers.Dense(1, activation='sigmoid'))
opt = keras.optimizers.RMSprop(learning_rate=0.0001)
model_loc.compile(optimizer=opt, loss='binary_crossentropy', metrics=[tf.keras.metrics.AUC(from_logits=True)])
model_loc.fit(train_data_insurance, train_labels, epochs=200, verbose=0)
<keras.callbacks.History at 0x7f70c05c72e0>
predictions_loc = model_loc.predict(test_data_insurance)
acc_loc=accuracy(predictions_loc, test_labels)
f1_loc=f1(predictions_loc, test_labels)
plot_roc(predictions_loc, test_labels, "./local.png")
3.3) Centralized (Data joined without any privacy)
The centralized data represents a node that has the whole dataset, joined without any kind of privacy. In principle, this will imply a better accuracy but, for sure, this can not happen in a real world scenario, where the data are dispersed over different organizations under the protection of privacy restrictions. We will load the centralized data, joining the two datasets that have been matched with a non private SI.
centralized_datasets.head()
COMUNIDAD_16000 | COMUNIDAD_20000 | COMUNIDAD_30000 | COMUNIDAD_40000 | COMUNIDAD_50000 | COMUNIDAD_60000 | COMUNIDAD_70000 | COMUNIDAD_80000 | COMUNIDAD_90000 | COMUNIDAD_100000 | COMUNIDAD_110000 | COMUNIDAD_120000 | COMUNIDAD_130000 | COMUNIDAD_140000 | COMUNIDAD_150000 | COMUNIDAD_160000 | COMUNIDAD_170000 | COMUNIDAD_180000 | COMUNIDAD_190000 | COMUNIDAD_200000 | COMUNIDAD_210000 | COMUNIDAD_220000 | COMUNIDAD_230000 | COMUNIDAD_240000 | COMUNIDAD_250000 | COMUNIDAD_260000 | COMUNIDAD_270000 | COMUNIDAD_280000 | COMUNIDAD_290000 | COMUNIDAD_300000 | COMUNIDAD_310000 | COMUNIDAD_320000 | COMUNIDAD_327680 | COMUNIDAD_330000 | COMUNIDAD_340000 | COMUNIDAD_350000 | COMUNIDAD_360000 | COMUNIDAD_370000 | COMUNIDAD_380000 | COMUNIDAD_390000 | COMUNIDAD_400000 | COMUNIDAD_410000 | COMUNIDAD_420000 | COMUNIDAD_430000 | COMUNIDAD_440000 | COMUNIDAD_450000 | COMUNIDAD_460000 | COMUNIDAD_470000 | COMUNIDAD_480000 | COMUNIDAD_490000 | COMUNIDAD_500000 | COMUNIDAD_510000 | COMUNIDAD_520000 | COMUNIDAD_530000 | COMUNIDAD_540000 | COMUNIDAD_550000 | COMUNIDAD_560000 | COMUNIDAD_570000 | COMUNIDAD_580000 | COMUNIDAD_590000 | COMUNIDAD_600000 | COMUNIDAD_610000 | COMUNIDAD_620000 | COMUNIDAD_630000 | COMUNIDAD_640000 | COMUNIDAD_650000 | COMUNIDAD_660000 | COMUNIDAD_670000 | COMUNIDAD_680000 | COMUNIDAD_690000 | COMUNIDAD_700000 | COMUNIDAD_710000 | COMUNIDAD_720000 | COMUNIDAD_730000 | COMUNIDAD_740000 | COMUNIDAD_750000 | COMUNIDAD_780000 | COMUNIDAD_800000 | COMUNIDAD_1000000 | SEXO_2 | EDUCATION_1 | EDUCATION_2 | EDUCATION_3 | EDUCATION_4 | EDUCATION_5 | EDUCATION_6 | TOMA_MEDIC_1 | TOMA_MEDIC_2 | TOMA_MEDIC_3 | IMC_-1 | IMC_0 | IMC_1 | IMC_2 | IMC_3 | IMC_4 | IMC_5 | IMC_6 | IMC_7 | IMC_8 | ESTANCIA_HOSPITAL_-1 | ESTANCIA_HOSPITAL_0 | ESTANCIA_HOSPITAL_1 | ESTANCIA_HOSPITAL_2 | ESTANCIA_HOSPITAL_3 | ESTANCIA_HOSPITAL_4 | ESTANCIA_HOSPITAL_5 | ESTANCIA_HOSPITAL_6 | ESTANCIA_HOSPITAL_7 | ESTANCIA_HOSPITAL_8 | VIDA_-1 | VIDA_0 | VIDA_1 | VIDA_2 | VIDA_3 | VIDA_4 | VIDA_5 | VIDA_6 | VIDA_7 | VIDA_8 | AUTO_-1 | AUTO_0 | AUTO_1 | AUTO_2 | AUTO_3 | AUTO_4 | AUTO_5 | AUTO_6 | AUTO_7 | AUTO_8 | HOGAR_-1 | HOGAR_0 | HOGAR_2 | HOGAR_3 | HOGAR_4 | HOGAR_5 | HOGAR_6 | HOGAR_7 | HOGAR_8 | EDAD | TV_-1 | TV_0 | TV_1 | TV_2 | TV_3 | TV_4 | TV_5 | TV_6 | TV_7 | TV_8 | N_PRODUCTOS_-1 | N_PRODUCTOS_0 | N_PRODUCTOS_1 | N_PRODUCTOS_2 | N_PRODUCTOS_3 | N_PRODUCTOS_4 | N_PRODUCTOS_5 | N_PRODUCTOS_6 | N_PRODUCTOS_7 | N_PRODUCTOS_8 | NIVEL_PASOS_-1 | NIVEL_PASOS_0 | NIVEL_PASOS_1 | NIVEL_PASOS_2 | NIVEL_PASOS_3 | NIVEL_PASOS_4 | NIVEL_PASOS_5 | NIVEL_PASOS_6 | NIVEL_PASOS_7 | NIVEL_PASOS_8 | SANA_-1 | SANA_0 | SANA_1 | SANA_2 | SANA_3 | SANA_4 | SANA_5 | SANA_6 | SANA_7 | SANA_8 | INFANTIL_-1 | INFANTIL_0 | INFANTIL_2 | INFANTIL_3 | INFANTIL_4 | INFANTIL_5 | INFANTIL_6 | INFANTIL_7 | INFANTIL_8 | FILTRADO | FACTURA | CAMBIO_DIRECCION | REDUCCION_LINEAS | AX_TIPO_FAMILIA | AX_MIEMBROS_FAMILIA | SEGURO_MOVIL | BROWSE_CATGRP.1 | BROWSE_CATGRP.2 | PERMISOS_PREF | PERMISOS_CORREO | PERMISOS_EMAIL | PERMISOS_DATOS | PERMISOS_TERCEROS | PERMISOS_GUIA | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -0.488815 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -0.743076 | -0.452551 | -0.425201 | -0.394723 | -0.377342 | -0.357742 | -0.309628 | 0.549134 | -0.151391 | -0.244443 | -0.211820 | -0.254979 | -0.254020 | -0.168838 | -0.293528 |
1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1.138659 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.044692 | -0.601028 | -0.597285 | -0.572894 | -0.552388 | -0.533558 | 0.149706 | -0.143327 | 1.293911 | -0.343620 | -0.235503 | -0.276625 | -0.284760 | 2.289437 | -0.229255 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -0.380507 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.593994 | 0.329651 | 0.346244 | 0.414485 | 0.516363 | 0.620675 | 0.678079 | 0.293225 | 0.224990 | -0.343620 | -0.125482 | -0.131411 | -0.119892 | -0.129906 | -0.129635 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.052722 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -0.457634 | -0.681959 | -0.683454 | -0.677148 | -0.666134 | 0.312837 | -0.643944 | -0.805682 | -0.678324 | -0.343620 | -0.271154 | -0.268814 | 3.408029 | -0.314737 | -0.272420 |
4 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.569025 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -0.607866 | -0.687947 | -0.683819 | -0.677527 | -0.666539 | -0.658120 | -0.644380 | -0.579879 | 0.390597 | -0.319890 | -0.272462 | -0.270265 | -0.282125 | -0.289347 | -0.175759 |
centralized_data = centralized_datasets.to_numpy()
train_centralized_data, test_centralized_data = split_train_test(centralized_data)
model_cent = tf.keras.models.Sequential()
model_cent.add(tf.keras.layers.Dense(32, activation='relu', kernel_initializer='he_normal', input_shape=(train_centralized_data.shape[1],)))
model_cent.add(tf.keras.layers.Dense(64, activation='relu', kernel_initializer='he_normal'))
model_cent.add(tf.keras.layers.Dense(32, activation='relu', kernel_initializer='he_normal'))
model_cent.add(tf.keras.layers.Dense(1, activation='sigmoid'))
opt = keras.optimizers.RMSprop(learning_rate=0.0001)
model_cent.compile(optimizer=opt, loss='binary_crossentropy', metrics=[tf.keras.metrics.AUC(from_logits=True)])
model_cent.fit(train_centralized_data, train_labels, epochs=200, verbose=0)
<keras.callbacks.History at 0x7f70c06957f0>
predictions_cent = model_cent.predict(test_centralized_data)
acc_cent=accuracy(predictions_cent, test_labels)
f1_cent=f1(predictions_cent, test_labels)
plot_roc(predictions_cent, test_labels, "./centralized.png")
3.4) Comparison
3.4.1) ROC curve
With the next function, we will plot a comparison between the metric of the three cases that we have presented:
values=[predictions_loc, predictions_fed, predictions_cent]
titles=['Local', 'Federated', 'Centralized']
colors=['blue', 'green', 'red']
linestyle=[':','-','-.']
s_plot_all_roc_curves(test_labels, values, titles, colors, linestyle)
3.4.2) F1-Score
To have an understanding of how the data of the telco company improves the predictions, we are going to observe the F1-Score metric. We use this metric because the data is really unbalanced, and the accuracy is not a good metric when this happens. Let's calculate the F1-Score of the federared learning model and compare them:
f1_fed=f1(predictions_fed, test_labels)
values=[round(f1_loc, 3), round(f1_fed, 3), round(f1_cent, 3)]
titles=['Local', 'Federated', 'Centralized']
colors=['blue', 'green', 'red']
s_plot_all_metric(values, "F1-Score", titles, colors)
These figure show three different results:
- In , we have the result of using just the local data of the insurance company. This correspond to the local case, where we have the features of the the insurance company but not the telco company.
- In , we have the result of the federate experiment, where we have used the features of both clients in a privacy preserving manner using Private Set Intersection.
- In , we have the result of using the centralized data aggregation of both clients without any kind of privacy.
The figures show the benefit of using the data of both clients in a privacy preserving manner in comparison with just one client. Even though the best scenario in term of the metric is where we use the data in a centralized way, the improvement with respect to the federated case is not really significant, since it is really similar.
In summary, by using linear models in the nodes, the results of the ROC AUC for the three different cases are:
Local (0.73) << Federated (0.85) < Centralized (0.91)
We improve by 0.12 the prediction of the insurance company by using the data of the telco company in a federated way while preserving the privacy. By using all the data BUT without preserving the privacy, we would only improve a 0.06 with respect to the federated model.
And the results of the F1 score are:
Local (0.451) << Federated (0.746) < Centralized (0.779)
We improve by 0.295 the prediction of the insurance company by using the data of the telco company in the federated way. By using all the data BUT without preserving the privacy, we would only improve a 0.033 with respect to the federated model.
The model trained with the insurance company's data solely has no data enrichment, whereas the federated one enjoys data enrichment, data privacy and complies with all the normative regulations, apart from having a greater accuracy improvement.
As a conclusion of this notebook, we can notice the benefits of using the Sherpa.ai’s Privacy-Preserving platform in a Vertical Federated scenario where the data of different parties are not aligned. The prediction is almost as much accurate as traditional machine learning methods, but with the significant benefit of ensuring the privacy of data and regulatory compliance.
Furthermore, PSI has been proven to work properly, since the most of the data has been correctly aligned.
Appendix: Using accuracy to compare the experiments
Here we are going to show the accuracy of the models. Before doing so, we have to calculate the accuracy of the federated model:
acc_fed=accuracy(predictions_fed, test_labels)
values=[round(acc_loc, 3), round(acc_fed, 3), round(acc_cent, 3)]
titles=['Local', 'Federated', 'Centralized']
colors=['blue', 'green', 'red']
s_plot_all_metric(values, "Accuracy", titles, colors)
If one trust the metric of accuracy, this will lead to erroneous conclusions because we can intuitively think that 0.8 of accuracy is fine. But if one classified all the elements as 0 in the local model, one would get an accuracy of:
round(sum(test_labels==0)/(sum(test_labels==1)+sum(test_labels==0)),5)
0.77998
This is only 0.023 less than the accuracy of the local model. So we reach to the conclusion that due to the unbalancedness of the dataset, the accuracy is not a good metric and that is why we use the F1-Score to do a fair comparison.