Vertical Federated Learning (VFL) is a crucial paradigm for training machine learning models on feature-partitioned, distributed data. However, due to privacy restrictions, few public real-world VFL datasets exist for algorithm evaluation, and these represent a limited array of feature distributions. Existing benchmarks often resort to synthetic datasets, derived from arbitrary feature splits from a global set, which only capture a subset of feature distributions, leading to inadequate algorithm performance assessment. This paper addresses these shortcomings by introducing two key factors affecting VFL performance - feature importance and feature correlation - and proposing associated evaluation metrics and dataset splitting methods. Additionally, we introduce a real VFL dataset to address the deficit in image-image VFL scenarios. Our comprehensive evaluation of cutting-edge VFL algorithms provides valuable insights for future research in the field.
翻译:纵向联邦学习(VFL)是一种在特征分割、分布式数据上训练机器学习模型的关键范式。然而,由于隐私限制,目前用于算法评估的真实VFL数据集数量稀少,且仅覆盖有限的特征分布类型。现有基准测试通常采用基于全局集任意特征划分生成的合成数据集,这类方法仅能捕获部分特征分布,导致算法性能评估不充分。本文通过引入影响VFL性能的两个关键因素——特征重要性与特征相关性——并提出相应的评估指标与数据集划分方法,以解决上述缺陷。此外,我们引入了一个真实的VFL数据集,以弥补图像-图像VFL场景的缺失。通过对前沿VFL算法的全面评估,本研究为该领域的未来研究提供了重要启示。