Autoencoders are popular neural networks that are able to compress high dimensional data to extract relevant latent information. TabNet is a state-of-the-art neural network model designed for tabular data that utilizes an autoencoder architecture for training. Vertical Federated Learning (VFL) is an emerging distributed machine learning paradigm that allows multiple parties to train a model collaboratively on vertically partitioned data while maintaining data privacy. The existing design of training autoencoders in VFL is to train a separate autoencoder in each participant and aggregate the latent representation later. This design could potentially break important correlations between feature data of participating parties, as each autoencoder is trained on locally available features while disregarding the features of others. In addition, traditional autoencoders are not specifically designed for tabular data, which is ubiquitous in VFL settings. Moreover, the impact of client failures during training on the model robustness is under-researched in the VFL scene. In this paper, we propose TabVFL, a distributed framework designed to improve latent representation learning using the joint features of participants. The framework (i) preserves privacy by mitigating potential data leakage with the addition of a fully-connected layer, (ii) conserves feature correlations by learning one latent representation vector, and (iii) provides enhanced robustness against client failures during training phase. Extensive experiments on five classification datasets show that TabVFL can outperform the prior work design, with 26.12% of improvement on f1-score.
翻译:自编码器是一种流行的神经网络,能够压缩高维数据以提取相关潜在信息。TabNet是一种专为表格数据设计的先进神经网络模型,其采用自编码器架构进行训练。垂直联邦学习是一种新兴的分布式机器学习范式,允许多个参与方在垂直分区数据上协同训练模型,同时保持数据隐私。现有VFL中训练自编码器的方案是在每个参与方本地训练独立的自编码器,随后聚合潜在表征。这种设计可能破坏参与方之间特征数据的重要关联性,因为每个自编码器仅基于本地可用特征进行训练,而忽略了其他参与方的特征。此外,传统自编码器并非专为表格数据设计,而这类数据在VFL场景中普遍存在。同时,训练过程中客户端故障对模型鲁棒性的影响在VFL领域尚未得到充分研究。本文提出TabVFL——一种利用参与方联合特征改进潜在表征学习的分布式框架。该框架具有以下特性:(i)通过添加全连接层缓解潜在数据泄露以保护隐私;(ii)通过学习单一潜在表征向量保持特征关联性;(iii)在训练阶段提供增强的客户端故障鲁棒性。在五个分类数据集上的大量实验表明,TabVFL能够超越现有设计方案,在F1分数上实现26.12%的性能提升。