Vertical Federated Learning (VFL) is a crucial paradigm for training machine learning models on feature-partitioned, distributed data. However, due to privacy restrictions, few public real-world VFL datasets exist for algorithm evaluation, and these represent a limited array of feature distributions. Existing benchmarks often resort to synthetic datasets, derived from arbitrary feature splits from a global set, which only capture a subset of feature distributions, leading to inadequate algorithm performance assessment. This paper addresses these shortcomings by introducing two key factors affecting VFL performance - feature importance and feature correlation - and proposing associated evaluation metrics and dataset splitting methods. Additionally, we introduce a real VFL dataset to address the deficit in image-image VFL scenarios. Our comprehensive evaluation of cutting-edge VFL algorithms provides valuable insights for future research in the field.
翻译:纵向联邦学习(Vertical Federated Learning,VFL)是一种在特征分区分布式数据上训练机器学习模型的关键范式。然而,由于隐私限制,用于算法评估的公开真实世界VFL数据集寥寥无几,且这些数据集仅呈现有限的特征分布组合。现有基准测试通常依赖全局特征集中经任意特征切分生成的合成数据集,这类数据集仅能捕获特征分布的特定子集,导致算法性能评估不够充分。本文通过引入影响VFL性能的两个关键因素——特征重要性与特征相关性,并提出相应的评估指标与数据集切分方法,以解决上述不足。此外,我们引入一个真实VFL数据集以弥补图像-图像场景下VFL数据集的缺失。通过对前沿VFL算法的全面评估,本文为该领域的未来研究提供了重要启示。