Federated Learning (FL) enables multiple parties to collaboratively train machine learning models without sharing raw data. However, before training, data must be preprocessed to address missing values, inconsistent formats, and heterogeneous feature scales. This preprocessing stage is critical for model performance but is largely overlooked in FL research. In practical FL systems, privacy constraints prohibit centralizing raw data, while communication efficiency introduces further challenges for distributed preprocessing. We introduce FedPS, a unified framework for federated data preprocessing based on aggregated statistics. FedPS leverages data-sketching techniques to efficiently summarize local datasets while preserving essential statistical information. Building on these summaries, we design federated algorithms for feature scaling, encoding, discretization, and missing-value imputation, and extend preprocessing-related models such as k-Means, k-Nearest Neighbors, and Bayesian Linear Regression to both horizontal and vertical FL settings. FedPS provides flexible, communication-efficient, and consistent preprocessing pipelines for practical FL deployments.
翻译:联邦学习(Federated Learning, FL)允许多方在不共享原始数据的情况下协作训练机器学习模型。然而,在训练之前,必须对数据进行预处理以处理缺失值、不一致的格式和异构的特征尺度。这一预处理阶段对模型性能至关重要,但在联邦学习研究中很大程度上被忽视。在实际的联邦学习系统中,隐私约束禁止集中原始数据,而通信效率则为分布式预处理带来了进一步的挑战。我们提出了FedPS,一个基于聚合统计的统一联邦数据预处理框架。FedPS利用数据素描技术高效地汇总本地数据集,同时保留必要的统计信息。基于这些汇总,我们设计了用于特征缩放、编码、离散化和缺失值插补的联邦算法,并将k-Means、k-最近邻和贝叶斯线性回归等与预处理相关的模型扩展到横向和纵向联邦学习设置中。FedPS为实际的联邦学习部署提供了灵活、通信高效且一致的预处理流程。