Federated Learning (FL) is an emerging machine learning paradigm that enables multiple parties to collaboratively train models without sharing raw data, ensuring data privacy. In Vertical FL (VFL), where each party holds different features for the same users, a key challenge is to evaluate the feature contribution of each party before any model is trained, particularly in the early stages when no model exists. To address this, the Shapley-CMI method was recently proposed as a model-free, information-theoretic approach to feature valuation using Conditional Mutual Information (CMI). However, its original formulation did not provide a practical implementation capable of computing the required permutations and intersections securely. This paper presents a novel privacy-preserving implementation of Shapley-CMI for VFL. Our system introduces a private set intersection (PSI) server that performs all necessary feature permutations and computes encrypted intersection sizes across discretized and encrypted ID groups, without the need for raw data exchange. Each party then uses these intersection results to compute Shapley-CMI values, computing the marginal utility of their features. Initial experiments confirm the correctness and privacy of the proposed system, demonstrating its viability for secure and efficient feature contribution estimation in VFL. This approach ensures data confidentiality, scales across multiple parties, and enables fair data valuation without requiring the sharing of raw data or training models.
翻译:联邦学习(Federated Learning, FL)是一种新兴的机器学习范式,允许多个参与方在不共享原始数据的情况下协同训练模型,从而确保数据隐私。在纵向联邦学习(Vertical FL, VFL)中,各参与方持有相同用户的不同特征,一个关键挑战是在任何模型训练之前(尤其是在早期尚无模型存在时)评估各参与方的特征贡献。为此,近期提出的Shapley-CMI方法作为一种无需模型的信息论特征价值评估方法,利用条件互信息(Conditional Mutual Information, CMI)实现评估。然而,其原始形式未能提供一种能够安全计算所需置换与交集的实际实现方案。本文提出了一种面向VFL的、具有隐私保护特性的Shapley-CMI新颖实现方案。我们的系统引入了一个私有集合交集(Private Set Intersection, PSI)服务器,该服务器执行所有必要的特征置换,并在离散化且加密的ID组上计算加密的交集大小,而无需交换原始数据。随后,各参与方利用这些交集结果计算Shapley-CMI值,以评估其特征的边际效用。初步实验验证了所提系统的正确性与隐私保护能力,证明了其在VFL中实现安全高效特征贡献评估的可行性。该方法确保了数据机密性,支持跨多参与方扩展,并能在无需共享原始数据或训练模型的情况下实现公平的数据价值评估。