Vertical federated learning (VFL) enables multiple parties with disjoint features of a common user set to train a machine learning model without sharing their private data. Tree-based models have become prevalent in VFL due to their interpretability and efficiency. However, the vulnerability of tree-based VFL has not been sufficiently investigated. In this study, we first introduce a novel label inference attack, ID2Graph, which utilizes the sets of record IDs assigned to each node (i.e., instance space)to deduce private training labels. ID2Graph attack generates a graph structure from training samples, extracts communities from the graph, and clusters the local dataset using community information. To counteract label leakage from the instance space, we propose two effective defense mechanisms, Grafting-LDP, which improves the utility of label differential privacy with post-processing, and andID-LMID, which focuses on mutual information regularization. Comprehensive experiments on various datasets reveal that ID2Graph presents significant risks to tree-based models such as RandomForest and XGBoost. Further evaluations of these benchmarks demonstrate that our defense methods effectively mitigate label leakage in such instances
翻译:纵向联邦学习(VFL)使得拥有相同用户集但特征不同的多方能够在无需共享私有数据的情况下训练机器学习模型。基于树的模型因其可解释性和高效性在VFL中变得日益普及。然而,树形VFL的脆弱性尚未得到充分研究。在本研究中,我们首先提出一种新型的标签推断攻击方法ID2Graph,该方法利用分配给每个节点的记录ID集合(即实例空间)来推断私有训练标签。ID2Graph攻击从训练样本中构建图结构,从图中提取社区,并利用社区信息对本地数据集进行聚类。为应对实例空间导致的标签泄露,我们提出了两种有效的防御机制:Grafting-LDP(通过后处理提升标签差分隐私的效用)和ID-LMID(聚焦于互信息正则化)。在多种数据集上的全面实验表明,ID2Graph对RandomForest和XGBoost等树形模型构成显著威胁。进一步的基准测试评估显示,我们的防御方法能有效缓解此类场景中的标签泄露问题。