Federated learning allows for the training of machine learning models on multiple decentralized local datasets without requiring explicit data exchange. However, data pre-processing, including strategies for handling missing data, remains a major bottleneck in real-world federated learning deployment, and is typically performed locally. This approach may be biased, since the subpopulations locally observed at each center may not be representative of the overall one. To address this issue, this paper first proposes a more consistent approach to data standardization through a federated model. Additionally, we propose Fed-MIWAE, a federated version of the state-of-the-art imputation method MIWAE, a deep latent variable model for missing data imputation based on variational autoencoders. MIWAE has the great advantage of being easily trainable with classical federated aggregators. Furthermore, it is able to deal with MAR (Missing At Random) data, a more challenging missing-data mechanism than MCAR (Missing Completely At Random), where the missingness of a variable can depend on the observed ones. We evaluate our method on multi-modal medical imaging data and clinical scores from a simulated federated scenario with the ADNI dataset. We compare Fed-MIWAE with respect to classical imputation methods, either performed locally or in a centralized fashion. Fed-MIWAE allows to achieve imputation accuracy comparable with the best centralized method, even when local data distributions are highly heterogeneous. In addition, thanks to the variational nature of Fed-MIWAE, our method is designed to perform multiple imputation, allowing for the quantification of the imputation uncertainty in the federated scenario.
翻译:联邦学习允许在多个分散的本地数据集上训练机器学习模型,无需显式数据交换。然而,包含缺失数据处理策略在内的数据预处理,仍是现实联邦学习部署中的主要瓶颈,且通常由本地独立执行。这种方法可能存在偏差,因为每个中心本地观测的子种群可能无法代表整体种群。针对此问题,本文首先提出一种更一致的联邦模型数据标准化方法。此外,我们提出Fed-MIWAE——当前最先进的插补方法MIWAE的联邦版本,MIWAE是一种基于变分自编码器处理缺失数据插补的深度潜变量模型。MIWAE的一大优势在于可通过经典联邦聚合器轻松训练,且能处理MAR(随机缺失)数据——这是比MCAR(完全随机缺失)更具挑战性的缺失机制,其中变量缺失状态可能依赖于已观测变量。我们在ADNI数据集模拟的联邦场景中,利用多模态医学影像数据和临床评分评估了该方法。我们将Fed-MIWAE与经典插补方法(本地执行或集中式执行)进行比较。结果表明,即使在本地数据分布高度异质的情况下,Fed-MIWAE也能达到与最佳集中式方法相当的插补精度。此外,得益于Fed-MIWAE的变分特性,该方法可执行多重插补,从而量化联邦场景中的插补不确定性。