Machine learning typically relies on the assumption that training and testing distributions are identical and that data is centrally stored for training and testing. However, in real-world scenarios, distributions may differ significantly and data is often distributed across different devices, organizations, or edge nodes. Consequently, it is imperative to develop models that can effectively generalize to unseen distributions where data is distributed across different domains. In response to this challenge, there has been a surge of interest in federated domain generalization (FDG) in recent years. FDG combines the strengths of federated learning (FL) and domain generalization (DG) techniques to enable multiple source domains to collaboratively learn a model capable of directly generalizing to unseen domains while preserving data privacy. However, generalizing the federated model under domain shifts is a technically challenging problem that has received scant attention in the research area so far. This paper presents the first survey of recent advances in this area. Initially, we discuss the development process from traditional machine learning to domain adaptation and domain generalization, leading to FDG as well as provide the corresponding formal definition. Then, we categorize recent methodologies into four classes: federated domain alignment, data manipulation, learning strategies, and aggregation optimization, and present suitable algorithms in detail for each category. Next, we introduce commonly used datasets, applications, evaluations, and benchmarks. Finally, we conclude this survey by providing some potential research topics for the future.
翻译:机器学习通常依赖于训练和测试分布一致且数据集中存储于同一位置用于训练和测试的假设。然而,在现实场景中,分布可能存在显著差异,且数据往往分布在不同的设备、组织或边缘节点上。因此,迫切需要开发能够在数据分布于不同领域的未见分布上有效泛化的模型。针对这一挑战,近年来联邦域泛化(FDG)引起了广泛关注。FDG结合了联邦学习(FL)和域泛化(DG)技术的优势,使多个源域能够协作学习一个可直接泛化到未见领域同时保护数据隐私的模型。然而,在领域偏移下实现联邦模型的泛化是一个技术难题,目前该领域研究尚不充分。本文首次综述了这一领域的最新进展。首先,我们讨论了从传统机器学习到域自适应和域泛化,进而发展到FDG的演进过程,并给出了相应的形式化定义。随后,我们将最新方法分为四类:联邦域对齐、数据操作、学习策略和聚合优化,并为每类方法详细介绍了代表性算法。接着,我们介绍了常用数据集、应用场景、评估指标和基准。最后,我们总结了本综述并提出了未来潜在的研究方向。