Vertical Federated Learning (VFL) is a crucial paradigm for training machine learning models on feature-partitioned, distributed data. However, due to privacy restrictions, few public real-world VFL datasets exist for algorithm evaluation, and these represent a limited array of feature distributions. Existing benchmarks often resort to synthetic datasets, derived from arbitrary feature splits from a global set, which only capture a subset of feature distributions, leading to inadequate algorithm performance assessment. This paper addresses these shortcomings by introducing two key factors affecting VFL performance - feature importance and feature correlation - and proposing associated evaluation metrics and dataset splitting methods. Additionally, we introduce a real VFL dataset to address the deficit in image-image VFL scenarios. Our comprehensive evaluation of cutting-edge VFL algorithms provides valuable insights for future research in the field.
翻译:纵向联邦学习(VFL)是一种在特征分区的分布式数据上训练机器学习模型的关键范式。然而,由于隐私限制,现有用于算法评估的公开真实世界VFL数据集数量稀少,且仅代表了有限的特征分布范围。现有基准方法通常采用从全局数据集中任意划分特征得到的合成数据集,这仅能捕获部分特征分布,导致算法性能评估不充分。本文针对上述不足,提出影响VFL性能的两个关键因素——特征重要性与特征相关性——并引入相应的评估指标与数据集拆分方法。此外,我们新增一个真实VFL数据集,以弥补图像-图像场景下VFL数据的匮乏。通过全面评估前沿VFL算法,本文为该领域的未来研究提供了重要启示。