TabVFL: Improving Latent Representation in Vertical Federated Learning

Autoencoders are popular neural networks that are able to compress high dimensional data to extract relevant latent information. TabNet is a state-of-the-art neural network model designed for tabular data that utilizes an autoencoder architecture for training. Vertical Federated Learning (VFL) is an emerging distributed machine learning paradigm that allows multiple parties to train a model collaboratively on vertically partitioned data while maintaining data privacy. The existing design of training autoencoders in VFL is to train a separate autoencoder in each participant and aggregate the latent representation later. This design could potentially break important correlations between feature data of participating parties, as each autoencoder is trained on locally available features while disregarding the features of others. In addition, traditional autoencoders are not specifically designed for tabular data, which is ubiquitous in VFL settings. Moreover, the impact of client failures during training on the model robustness is under-researched in the VFL scene. In this paper, we propose TabVFL, a distributed framework designed to improve latent representation learning using the joint features of participants. The framework (i) preserves privacy by mitigating potential data leakage with the addition of a fully-connected layer, (ii) conserves feature correlations by learning one latent representation vector, and (iii) provides enhanced robustness against client failures during training phase. Extensive experiments on five classification datasets show that TabVFL can outperform the prior work design, with 26.12% of improvement on f1-score.

翻译：自编码器是一种流行的神经网络，能够压缩高维数据以提取相关潜在信息。TabNet是一种专为表格数据设计的先进神经网络模型，其采用自编码器架构进行训练。垂直联邦学习是一种新兴的分布式机器学习范式，允许多个参与方在垂直分区数据上协同训练模型，同时保持数据隐私。现有VFL中训练自编码器的方案是在每个参与方本地训练独立的自编码器，随后聚合潜在表征。这种设计可能破坏参与方之间特征数据的重要关联性，因为每个自编码器仅基于本地可用特征进行训练，而忽略了其他参与方的特征。此外，传统自编码器并非专为表格数据设计，而这类数据在VFL场景中普遍存在。同时，训练过程中客户端故障对模型鲁棒性的影响在VFL领域尚未得到充分研究。本文提出TabVFL——一种利用参与方联合特征改进潜在表征学习的分布式框架。该框架具有以下特性：（i）通过添加全连接层缓解潜在数据泄露以保护隐私；（ii）通过学习单一潜在表征向量保持特征关联性；（iii）在训练阶段提供增强的客户端故障鲁棒性。在五个分类数据集上的大量实验表明，TabVFL能够超越现有设计方案，在F1分数上实现26.12%的性能提升。

相关内容

自编码器

关注 141

自动编码器是一种人工神经网络，用于以无监督的方式学习有效的数据编码。自动编码器的目的是通过训练网络忽略信号“噪声”来学习一组数据的表示（编码），通常用于降维。与简化方面一起，学习了重构方面，在此，自动编码器尝试从简化编码中生成尽可能接近其原始输入的表示形式，从而得到其名称。基本模型存在几种变体，其目的是迫使学习的输入表示形式具有有用的属性。自动编码器可有效地解决许多应用问题，从面部识别到获取单词的语义。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日