Vertical Federated Learning (VFL) enables multiple data owners, each holding a different subset of features about largely overlapping sets of data sample(s), to jointly train a useful global model. Feature selection (FS) is important to VFL. It is still an open research problem as existing FS works designed for VFL either assumes prior knowledge on the number of noisy features or prior knowledge on the post-training threshold of useful features to be selected, making them unsuitable for practical applications. To bridge this gap, we propose the Federated Stochastic Dual-Gate based Feature Selection (FedSDG-FS) approach. It consists of a Gaussian stochastic dual-gate to efficiently approximate the probability of a feature being selected, with privacy protection through Partially Homomorphic Encryption without a trusted third-party. To reduce overhead, we propose a feature importance initialization method based on Gini impurity, which can accomplish its goals with only two parameter transmissions between the server and the clients. Extensive experiments on both synthetic and real-world datasets show that FedSDG-FS significantly outperforms existing approaches in terms of achieving accurate selection of high-quality features as well as building global models with improved performance.
翻译:纵向联邦学习(VFL)使多个数据持有者(每个持有者关于高度重叠的数据样本集合的不同特征子集)能够联合训练一个有用的全局模型。特征选择(FS)对于VFL至关重要,但目前仍是一个开放的研究问题,因为现有针对VFL设计的FS方法要么假设噪声特征数量的先验知识,要么假设待选有用特征训练后阈值的先验知识,使其难以适用于实际应用。为弥补这一空白,我们提出基于联邦随机双门控的特征选择方法(FedSDG-FS)。该方法包含一个高斯随机双门控机制,用于高效近似特征被选择的概率,并通过部分同态加密在没有可信第三方的情况下实现隐私保护。为降低开销,我们提出一种基于基尼不纯度的特征重要性初始化方法,仅需服务器与客户端之间两次参数传输即可完成目标。在合成数据集和真实数据集上的大量实验表明,FedSDG-FS在准确选择高质量特征及构建性能更优的全局模型方面显著优于现有方法。