Vertical Federated Learning with platform using PSI

Federated Learning is a Machine Learning paradigm aimed at learning models from decentralized data, such as data located on users’ smartphones, in hospitals, or banks, and ensuring data privacy. This is achieved by training the model locally in each node (e.g., on each smartphone, at each hospital, or at each bank), sharing the model-updated local parameters (not the data) and securely aggregating them to build a better global model.

Traditional Machine Learning requires all the data to be gathered in one single place. In practice, this is often forbidden by privacy regulations. For this reason, Federated Learning is introduced, the goal being to learn from a large amount of data, while preserving privacy.

The supported Federated Learning categories are: Horizontal, Vertical and Transfer. In this notebook, we will train by Vertical Federated Learning, where the nodes share overlapping samples (share the same sample ID space) but differ in data features. VFL employs this heterogeneity to train a more accurate model. The main idea to do this is to split a Neural Network among different parties and a server.

The vertical federated learning paradigm defined by

VFL requires the nodes to share overlapping samples, i.e. the two clients must have the same samples in the same oder (the first row of the first client must match with the first row of the second client, etcetera...). In practice, this assumption is not verified. To reduce to such assumption, we perform Private Set Intersection (PSI) or Privacy Preserving Entity Resolution (PPER).

Private Set Intersection concept as done with the platform

In this procedure, encrypted identifiers (e.g. email, card ID...) are shared between the parties to enable each company to link the customers they have in common before the training of the local models. As only encrypted identifiers are shared, both customers’ identifiers and the other private features (e.g. age, salary, postal code) of each company are kept safe. The identifier can be a composition of two or more variables with a certain transformation. This can lead to more security as the raw identifier is not shared.

Entity matching procedure as it is done in the platform

Nevertheless, Company A may be able to know which clients it has in common with Company B. The same applies the other way around. Technically speaking, PSI would make sample identifiers of the intersection visible to all parties, and therefore each party can know that the data entities shown in the intersection also appear in the other parties. Nonetheless, neither of the companies can obtain the data from the other company, information regarding the clients stays safe.

On the one hand, there are cases case where this membership information leakage is allowed between the companies. On the other hand, In some other cases, this membership information is sensitive and must be protected because of privacy standards and regulations.

In our case, we assume that the Insurance company and the Telco companu have the agreement of knowing the sample ID of the intersection.

A Telco and an Insurance company wants to share information while complying with privacy regulations. In this notebook we will simulate a fictional scenario where a insurance company and a Telco want to collaborate to train a model using's platform in a private way that will allow to predict if a person contracts a life insurance. a insurance company has poor data that does not allow him to perform good predictions. With the knowledge of the insurance, the telco will improve its predictions. So we are trying to predict both the probability of death and purchase, using private client data from the telco and the insurance, without the companies sharing each other’s data. The general description of the problem is:

The architecture of the vertical solution for the telco company and a insurance company proposed by

VFL requires the nodes to share overlapping samples, i.e. the two clients must have the same samples in the same oder (the first row of the first client must match with the first row of the second clients, etcetera...). In practice, this assumption is not verified. To reduce to such assumption, we perform Private Set Intersection (PSI) or Privacy Preserving Entity Resolution (PPER).

Once that we have a general overview of our problem, the procedure is the following:


As we can see, the preprocessing in this case is done before we do the PSI.

0) Libraries and data

import warnings

import matplotlib
import tensorflow as tf
from shfl.auxiliar_functions_for_notebooks import intersection_federated_government
from shfl.auxiliar_functions_for_notebooks.functionsFL import *
from shfl.auxiliar_functions_for_notebooks.node_initialization import nodes_list, nodes_federation
from shfl.auxiliar_functions_for_notebooks.preprocessing import *
from shfl.data_base.data_base import split_train_test
from shfl.federated_aggregator import FedSumAggregator
from shfl.federated_government.vertical_federated_government import VerticalFederatedGovernment
from shfl.model.vertical_deep_learning_model_tf import VerticalNeuralNetClientModelTensorFlow
from shfl.model.vertical_deep_learning_model_tf import VerticalNeuralNetServerModelTensorFlow
from shfl.private.reproducibility import Reproducibility
from tensorflow import keras'seaborn')
pd.set_option("display.max_rows", 30, "display.max_columns", None)
2022-05-10 18:03:40.341207: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory
2022-05-10 18:03:40.341228: I tensorflow/stream_executor/cuda/] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
/home/f.palomino/Desktop/venvpruebas/environment/lib/python3.8/site-packages/tqdm/ TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See
  from .autonotebook import tqdm as notebook_tqdm

1 nodes_list= <shfl.private.federated_operation.HeterogeneousDataNode object at 0x7f70e744fb50>
1 nodes_list= <shfl.private.federated_operation.HeterogeneousDataNode object at 0x7f70e73fe730>
1 nodes_list= <shfl.private.federated_operation.VerticalServerDataNode object at 0x7f70e744fd90>

<shfl.private.reproducibility.Reproducibility at 0x7f71e4414130>

We load the data of the insurance company and the telco company:

insurance = pd.read_csv("./data_from_insurance.csv", sep=",")
telco = pd.read_csv("./data_from_telco.csv", sep=",")
  • The data from the insurance company:
  • The data from the telco company:

1) Prepare the data before doing the vertical federated learning scenario with PSI

In this notebook, we are going to suppose that both the insurance company and the telco company preprocess its data before using the Federated Learning framework. So we process as follow:

print('There are {} observations with {} features from the insurance company.'.format(insurance.shape[0], insurance.shape[1]))
print('There are {} observations with {} features from the telco company.'.format(telco.shape[0], telco.shape[1]))
There are 26100 observations with 13 features from the insurance company.
There are 26100 observations with 22 features from the telco company.

1.1) Preprocessing of the insurance company

the insurance company selects its categorical and numerical variables:

numerical_insurance = ["EDAD"]

The variable to predict is TARGET:

target = insurance.TARGET
labels = target.to_numpy().reshape(-1,1)

We do not want to alter the ID since we need it to perform the PSI. Furthermore, the insurance company one-hot its categorical variables and normalized its numerical variables.

cat_insurance = one_hot_encode(insurance[categorical_insurance], categorical_insurance)
num_insurance = normalize_data(insurance[numerical_insurance])
insurance_without_id = pd.concat([cat_insurance, num_insurance], axis=1)
insurance = pd.concat([pd.DataFrame(id1_insurance), pd.DataFrame(id2_insurance)], axis=1)
insurance = pd.concat([insurance, pd.DataFrame(insurance_without_id)], axis=1)
insurance = pd.concat([insurance, target], axis=1)
del categorical_insurance, numerical_insurance, cat_insurance, num_insurance, insurance_without_id, id1_insurance, id2_insurance

Finally the data from the insurance company is:


The server in this case corresponds to the insurance company, since it is the one that has the labels:

data_server = insurance[["ID_1", "ID_2", "TARGET"]]

1.2) Preprocessing of the telco company

the telco company does the same process:

categorical_telco = ["TV", "N_PRODUCTOS", "NIVEL_PASOS", "SANA", "INFANTIL"]
                    'PERMISOS_TERCEROS', 'PERMISOS_GUIA']
cat_telco = one_hot_encode(telco[categorical_telco], categorical_telco)
num_telco = normalize_data(telco[numerical_telco])
telco_without_id = pd.concat([cat_telco, num_telco], axis=1)
telco = pd.concat([pd.DataFrame(id1_telco), pd.DataFrame(id2_telco)], axis=1)
telco = pd.concat([telco, telco_without_id], axis=1)
del id1_telco, id2_telco, telco_without_id, cat_telco, num_telco, categorical_telco, numerical_telco

2) PSI

To do the entity matching, we have to select a variable (or some variables) which uniquely correspond to the same instance. This matching is going to be done through a variable ID_1 and ID_2, that we will add to our datasets. There will be samples that are exclusive for the insurance company, and other samples exclusives for the telco company. The objetive is to match those variable that are common between the clients. If there are not the same number of instances for the insurance company node and the the telco company node, with the PSI, we will work only with the intersection of those observations, this is, the instances that are both in the insurance company and the telco company.

2.1) Execute PSI

In node_initialization, nodes in nodes_list are endowed with tools (functions, data structures) to perform hashing and PSI and Vertical Federated Learning (VFL). Note that the first node (client node) and the last node (server node) are supposed to be mounted in the same physical node (the insurance company, who has the labels).


As we have presented, PSI is a technique aimed at determining the intersection of two private sets, without sharing the elements of such sets.

In our case, the goal is to determine the intersection of identifiers sets IAI_A and IBI_B (e.g. e-mail, ID-card number or name) of samples of datasets owned by different parties, without sharing identifiers values. Indeed, an identifier may be private.

Once the intersection is determined, intersecting identifiers are ordered, so aligning datasets belonging to different organizations.

The process of performing the Private Set Intersection (PSI) in the platform
p = 1048343
feast_list=["ID_1", "ID_2"]

intersection_federated_government.run_intersection(nodes_list, feast_list, p)
STEP 1. Hash identifiers onto Z_p.
STEP 2. First checks and parameters definition.
STEP 3. Send encrypted identifiers to the server.
STEP 4. Compute intersection hashed identifiers.

Since the instances are not aligned, with PSI will be in charge of align the matches in a private preserving manner. So, the datasets will be synchronized.

for k in range(n + 1):

2.2) Compare PSI with a non-private Set Intersection

Now we will obtain the intersection manually without caring about privacy. This will help us later to evaluate how accurate the PSI was performed in comparison with a non-private intersection:

intersected_centralized_data = insurance.merge(telco, how = 'inner', on = ['ID_1', 'ID_2'])

The SI without any kind of privacy, gives that the number of elements that are common in the insurance company and the telco company are:

#np means no privacy
inters_np = intersected_centralized_data["ID_1"].tolist()

This means that the porcentaje of data of each node that were not in common with the other party are:

round(100-(len(inters_np)*100 / insurance.shape[0]),3)
round(100-(len(inters_np)*100 / telco.shape[0]),3)

This means that not all the data of each party will be used in the training. Just the data that is intersected.

To have a benchmark of the performance of federated learning in comparison with a non-private centralized case, we are going to join this already shuffled and intersected datasets. Obviously, this is done just for study and must be forbidden in a real world scenario.

labels_insurance_for_comparison = nodes_list[0].query()["TARGET"]
labels_insurance_for_comparison = label_encoder(labels_insurance_for_comparison)
data_insurance_for_comparison = nodes_list[0].query().drop(["ID_1", "ID_2", "TARGET"], axis=1)
data_telco_for_comparison = nodes_list[1].query().drop(["ID_1", "ID_2"], axis=1)
centralized_datasets = pd.concat([data_insurance_for_comparison.reset_index(drop=True), data_telco_for_comparison], axis=1)
del data_telco_for_comparison

2.3) Data preprocessing

Now that the data are correctly aligned we should preprocess it. The preprocessing will consist in:

  1. Drop the ID variable since it is only used to do the matching
  2. If necessary, split the data into inputs and labels.

That will be done with drop_id_and_split_label, a function inside the nodes:

nodes_list[0].call('drop_id_and_split_label', label_name='TARGET', id_name=feast_list)
nodes_list[1].call('drop_id_and_split_label', label_name='TARGET', id_name=feast_list)
nodes_list[2].call('drop_id_and_split_label', label_name='TARGET', id_name= feast_list)

And as the final step, we can internally split the data in train and test in each node. In PSI, the cardinality of the intersection is known for the clients. As they have synchronized their datasets, the orchestator can send the order to take a portion for training and a portion of testing for each client. In this case, the portion will be the 80% first columns:

nodes_list[0].call('split_train_test', train_proportion=0.8)
nodes_list[1].call('split_train_test', train_proportion=0.8)
nodes_list[2].call('split_train_test', train_proportion=0.8, is_label=True)

3) Run the experiment

Now we are going to execute the federated, local and centralized experiments in order to compare the results and illustrate how the model's metrics behave in these scenarios. As in the notebook explaining the basic concepts of vFL, we are going to emulate the process of creating the whole structure and we will train the models.

3.1) Federated

The key idea of VFL is to enhance a learning model by utilizing the distributed data with the different attributes of the different parties. Hence, VFL has vertically partitioned data where participants' data share the same sample space with a different attribute space.

Workflow for the vertical federated learning structure with the platform

In the remainder, we are going to test our privacy-preserving Federated Learning technology. In the spirit of privacy preservation, we set up a more strict access policy.

def privacy_preserving_query(dataset):
    """Returns only the number of columns of dataset. """

n = 2
for i in range(n + 1):

We need to do this to give the models its input data shape.

client_out_dim = 32

model0 = tf.keras.models.Sequential()
model0.add(tf.keras.layers.Dense(client_out_dim, activation='relu', kernel_initializer='he_normal', input_shape=(nodes_list[0].query(),)))

model1 = tf.keras.models.Sequential()
model1.add(tf.keras.layers.Dense(client_out_dim, activation='relu', kernel_initializer='he_normal', input_shape=(nodes_list[1].query(),)))

optimizer0 = tf.keras.optimizers.RMSprop(learning_rate=0.0001)
optimizer1 = tf.keras.optimizers.RMSprop(learning_rate=0.0001)

batch_size = 32
model_nodes = [VerticalNeuralNetClientModelTensorFlow(model=model0, loss=None, optimizer=optimizer0, batch_size=batch_size),
               VerticalNeuralNetClientModelTensorFlow(model=model1, loss=None, optimizer=optimizer1, batch_size=batch_size)]
2022-05-10 18:03:54.444159: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory
2022-05-10 18:03:54.444194: W tensorflow/stream_executor/cuda/] failed call to cuInit: UNKNOWN ERROR (303)
2022-05-10 18:03:54.444227: I tensorflow/stream_executor/cuda/] kernel driver does not appear to be running on this host (sh-015-ws): /proc/driver/nvidia/version does not exist
2022-05-10 18:03:54.444547: I tensorflow/core/platform/] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
# Define the model of the server node
model_server = tf.keras.models.Sequential()
model_server.add(tf.keras.layers.Dense(64, activation='relu', kernel_initializer='he_normal', input_shape=(client_out_dim,)))
model_server.add(tf.keras.layers.Dense(32, activation='relu', kernel_initializer='he_normal'))
model_server.add(tf.keras.layers.Dense(1, activation='sigmoid'))

loss_server = tf.keras.losses.BinaryCrossentropy(from_logits=False)
optimizer_server = tf.keras.optimizers.RMSprop(learning_rate=0.0001)

model = VerticalNeuralNetServerModelTensorFlow(model_server, loss_server, optimizer_server,

# Set the model and the aggregator in the server node

# Configure data access to nodes and server
# Convert to float        
# Create federated government
federated_government = VerticalFederatedGovernment(model_nodes, 
# Run training and testing
Evaluation in  round  0 :
loss: 0.5725527405738831   auc: 0.6032233238220215

Evaluation in  round  10000 :
loss: 1.6924785375595093   auc: 0.6986650228500366

Evaluation in  round  20000 :
loss: 2.4987637996673584   auc: 0.7234680652618408

Evaluation in  round  30000 :
loss: 2.7209153175354004   auc: 0.7400737404823303

Evaluation in  round  40000 :
loss: 2.5503551959991455   auc: 0.7559691667556763

Evaluation in  round  50000 :
loss: 2.406587839126587   auc: 0.7721993327140808

Evaluation in  round  60000 :
loss: 2.7810895442962646   auc: 0.7838234901428223

Evaluation in  round  70000 :
loss: 3.3195133209228516   auc: 0.7927149534225464

Evaluation in  round  80000 :
loss: 3.5789554119110107   auc: 0.7998210191726685

Evaluation in  round  90000 :
loss: 3.9660234451293945   auc: 0.8050363659858704

Evaluation in  round  100000 :
loss: 4.442634105682373   auc: 0.8091144561767578
predictions_fed = nodes_list[2].call('s_plot_roc')
ROC AUC results using the platform for the federated training with the telco company and the insurance company

3.2) Local (Data from the insurance company only)

The local case refers to the situation where we only have the data belonging to one of the clients, which in this case is the insurance company. This will be useful to understand how the Telco data improves the metric of the model obtained by using only the data of the the insurance company. We have to comment that the data should always remain inside the node. As this is an experimental notebook to illustrate the federated experiment and doing a comparison, we will use the test data of the local party (the the insurance company), but keep in mind that this operation is done locally by one party.

local_data = data_insurance_for_comparison.to_numpy()
del data_insurance_for_comparison
train_data_insurance, train_labels, test_data_insurance, test_labels = split_train_test(local_data, 
model_loc = tf.keras.models.Sequential()
model_loc.add(tf.keras.layers.Dense(32, activation='relu', kernel_initializer='he_normal', input_shape=(train_data_insurance.shape[1],)))
model_loc.add(tf.keras.layers.Dense(64, activation='relu', kernel_initializer='he_normal'))
model_loc.add(tf.keras.layers.Dense(32, activation='relu', kernel_initializer='he_normal'))
model_loc.add(tf.keras.layers.Dense(1, activation='sigmoid'))

opt = keras.optimizers.RMSprop(learning_rate=0.0001)
model_loc.compile(optimizer=opt, loss='binary_crossentropy', metrics=[tf.keras.metrics.AUC(from_logits=True)]), train_labels, epochs=200, verbose=0)
<keras.callbacks.History at 0x7f70c05c72e0>
predictions_loc = model_loc.predict(test_data_insurance)
acc_loc=accuracy(predictions_loc, test_labels)
f1_loc=f1(predictions_loc, test_labels)
plot_roc(predictions_loc, test_labels, "./local.png")
ROC AUC results using the platform for the local training with the insurance company's dataset

3.3) Centralized (Data joined without any privacy)

The centralized data represents a node that has the whole dataset, joined without any kind of privacy. In principle, this will imply a better accuracy but, for sure, this can not happen in a real world scenario, where the data are dispersed over different organizations under the protection of privacy restrictions. We will load the centralized data, joining the two datasets that have been matched with a non private SI.

centralized_data = centralized_datasets.to_numpy()
train_centralized_data, test_centralized_data = split_train_test(centralized_data)
model_cent = tf.keras.models.Sequential()
model_cent.add(tf.keras.layers.Dense(32, activation='relu', kernel_initializer='he_normal', input_shape=(train_centralized_data.shape[1],)))
model_cent.add(tf.keras.layers.Dense(64, activation='relu', kernel_initializer='he_normal'))
model_cent.add(tf.keras.layers.Dense(32, activation='relu', kernel_initializer='he_normal'))
model_cent.add(tf.keras.layers.Dense(1, activation='sigmoid'))

opt = keras.optimizers.RMSprop(learning_rate=0.0001)
model_cent.compile(optimizer=opt, loss='binary_crossentropy', metrics=[tf.keras.metrics.AUC(from_logits=True)]), train_labels, epochs=200, verbose=0)
<keras.callbacks.History at 0x7f70c06957f0>
predictions_cent = model_cent.predict(test_centralized_data)
acc_cent=accuracy(predictions_cent, test_labels)
f1_cent=f1(predictions_cent, test_labels)
plot_roc(predictions_cent, test_labels, "./centralized.png")
ROC AUC results using the platform for the centralized training of the telco company and the insurance company

3.4) Comparison

3.4.1) ROC curve

With the next function, we will plot a comparison between the metric of the three cases that we have presented:

values=[predictions_loc, predictions_fed, predictions_cent]
titles=['Local', 'Federated', 'Centralized']
colors=['blue', 'green', 'red']

s_plot_all_roc_curves(test_labels, values, titles, colors, linestyle)
ROC AUC results using the platform in comparison with local and centralized scenarios

3.4.2) F1-Score

To have an understanding of how the data of the telco company improves the predictions, we are going to observe the F1-Score metric. We use this metric because the data is really unbalanced, and the accuracy is not a good metric when this happens. Let's calculate the F1-Score of the federared learning model and compare them:

f1_fed=f1(predictions_fed, test_labels)
values=[round(f1_loc, 3), round(f1_fed, 3), round(f1_cent, 3)]
titles=['Local', 'Federated', 'Centralized']
colors=['blue', 'green', 'red']
s_plot_all_metric(values, "F1-Score", titles, colors)
F1-Score results using the platform in comparison with local and centralized scenarios

These figure show three different results:

  • In blue\color{blue}{\text{blue}}, we have the result of using just the local data of the insurance company. This correspond to the local case, where we have the features of the the insurance company but not the telco company.
  • In green\color{green}{\text{green}}, we have the result of the federate experiment, where we have used the features of both clients in a privacy preserving manner using Private Set Intersection.
  • In red\color{red}{\text{red}}, we have the result of using the centralized data aggregation of both clients without any kind of privacy.

The figures show the benefit of using the data of both clients in a privacy preserving manner in comparison with just one client. Even though the best scenario in term of the metric is where we use the data in a centralized way, the improvement with respect to the federated case is not really significant, since it is really similar.

In summary, by using linear models in the nodes, the results of the ROC AUC for the three different cases are:

Local (0.73) << Federated (0.85) < Centralized (0.91)

We improve by 0.12 the prediction of the insurance company by using the data of the telco company in a federated way while preserving the privacy. By using all the data BUT without preserving the privacy, we would only improve a 0.06 with respect to the federated model.

And the results of the F1 score are:

Local (0.451) << Federated (0.746) < Centralized (0.779)

We improve by 0.295 the prediction of the insurance company by using the data of the telco company in the federated way. By using all the data BUT without preserving the privacy, we would only improve a 0.033 with respect to the federated model.

Improvement of F1-Score using the federated framework

The model trained with the insurance company's data solely has no data enrichment, whereas the federated one enjoys data enrichment, data privacy and complies with all the normative regulations, apart from having a greater accuracy improvement.

As a conclusion of this notebook, we can notice the benefits of using the’s Privacy-Preserving platform in a Vertical Federated scenario where the data of different parties are not aligned. The prediction is almost as much accurate as traditional machine learning methods, but with the significant benefit of ensuring the privacy of data and regulatory compliance.

Furthermore, PSI has been proven to work properly, since the most of the data has been correctly aligned.

Appendix: Using accuracy to compare the experiments

Here we are going to show the accuracy of the models. Before doing so, we have to calculate the accuracy of the federated model:

acc_fed=accuracy(predictions_fed, test_labels)
values=[round(acc_loc, 3), round(acc_fed, 3), round(acc_cent, 3)]
titles=['Local', 'Federated', 'Centralized']
colors=['blue', 'green', 'red']
s_plot_all_metric(values, "Accuracy", titles, colors)
Accuracy results using the platform in comparison with local and centralized scenarios for the the telco company and the insurance company case

If one trust the metric of accuracy, this will lead to erroneous conclusions because we can intuitively think that 0.8 of accuracy is fine. But if one classified all the elements as 0 in the local model, one would get an accuracy of:


This is only 0.023 less than the accuracy of the local model. So we reach to the conclusion that due to the unbalancedness of the dataset, the accuracy is not a good metric and that is why we use the F1-Score to do a fair comparison.