Federated learning is a learning paradigm to enable collaborative learning across different parties without revealing raw data. Notably, vertical federated learning (VFL), where parties share the same set of samples but only hold partial features, has a wide range of real-world applications. However, most existing studies in VFL disregard the "record linkage" process. They design algorithms either assuming the data from different parties can be exactly linked or simply linking each record with its most similar neighboring record. These approaches may fail to capture the key features from other less similar records. Moreover, such improper linkage cannot be corrected by training since existing approaches provide no feedback on linkage during training. In this paper, we design a novel coupled training paradigm, FedSim, that integrates one-to-many linkage into the training process. Besides enabling VFL in many real-world applications with fuzzy identifiers, FedSim also achieves better performance in traditional VFL tasks. Moreover, we theoretically analyze the additional privacy risk incurred by sharing similarities. Our experiments on eight datasets with various similarity metrics show that FedSim outperforms other state-of-the-art baselines. The codes of FedSim are available at https://github.com/Xtra-Computing/FedSim.
翻译:联邦学习是一种无需泄露原始数据即可实现多方协作学习的范式。值得注意的是,纵向联邦学习(VFL)中各方共享同一组样本但仅持有部分特征,在现实世界中有广泛应用。然而,现有大多数VFL研究忽视了"记录链接"过程。它们设计的算法要么假设不同方的数据能够精确链接,要么简单地将每条记录与其最相似的相邻记录链接。这些方法可能无法从其他相似度较低的记录中捕获关键特征。此外,由于现有方法在训练过程中未提供关于链接的反馈,这种不当链接无法通过训练来纠正。本文设计了一种新颖的协同训练范式FedSim,将一对多链接集成到训练过程中。除了能在许多具有模糊标识符的现实应用中实现VFL外,FedSim在传统VFL任务中也取得了更优性能。此外,我们从理论上分析了共享相似性带来的额外隐私风险。在包含多种相似度度量的八个数据集上的实验表明,FedSim优于其他现有最先进基线方法。FedSim的代码可在https://github.com/Xtra-Computing/FedSim获取。