We propose LESS-VFL, a communication-efficient feature selection method for distributed systems with vertically partitioned data. We consider a system of a server and several parties with local datasets that share a sample ID space but have different feature sets. The parties wish to collaboratively train a model for a prediction task. As part of the training, the parties wish to remove unimportant features in the system to improve generalization, efficiency, and explainability. In LESS-VFL, after a short pre-training period, the server optimizes its part of the global model to determine the relevant outputs from party models. This information is shared with the parties to then allow local feature selection without communication. We analytically prove that LESS-VFL removes spurious features from model training. We provide extensive empirical evidence that LESS-VFL can achieve high accuracy and remove spurious features at a fraction of the communication cost of other feature selection approaches.
翻译:我们提出LESS-VFL,一种面向纵向划分数据分布式系统的通信高效特征选择方法。考虑一个包含服务器和多个参与方的系统,各参与方拥有共享样本ID空间但特征集不同的本地数据集,这些参与方希望针对某项预测任务协同训练模型。在训练过程中,参与方希望移除系统中的不重要特征以提升泛化能力、效率和可解释性。在LESS-VFL中,经过短期预训练后,服务器优化其全局模型部分以确定参与方模型的相关输出,并将该信息共享给参与方,从而在不产生通信开销的情况下实现本地特征选择。我们通过理论分析证明LESS-VFL能够从模型训练中移除虚假特征,并通过大量实验证据表明,LESS-VFL能以远低于其他特征选择方法的通信成本实现高精度并消除虚假特征。