数据共享约束下异构多域环境的分布式生成式人工智能方法 (A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints)

Federated Learning has gained attention for its ability to enable multiple nodes to collaboratively train machine learning models without sharing raw data. At the same time, Generative AI -- particularly Generative Adversarial Networks (GANs) -- have achieved remarkable success across a wide range of domains, such as healthcare, security, and Image Generation. However, training generative models typically requires large datasets and significant computational resources, which are often unavailable in real-world settings. Acquiring such resources can be costly and inefficient, especially when many underutilized devices -- such as IoT devices and edge devices -- with varying capabilities remain idle. Moreover, obtaining large datasets is challenging due to privacy concerns and copyright restrictions, as most devices are unwilling to share their data. To address these challenges, we propose a novel approach for decentralized GAN training that enables utilizing distributed data and underutilized, low-capability devices while not sharing data in its raw form. Our approach is designed to tackle key challenges in decentralized environments, combining KLD-weighted Clustered Federated Learning to address the issues of data heterogeneity and multi-domain datasets, with Heterogeneous U-Shaped split learning to tackle the challenge of device heterogeneity under strict data sharing constraints -- ensuring that no labels or raw data, whether real or synthetic, are ever shared between nodes. Experiments show that our approach demonstrates significant improvements across key metrics, where it achieves an average 10% boost in classification metrics (up to 60% in multi-domain non-IID settings), 1.1x -- 3x higher image generation scores for the MNIST family datasets, and 2x -- 70x lower FID scores for higher resolution datasets. Find our code at https://distributed-gen-ai.github.io/huscf-gan.github.io/.

翻译：联邦学习因其能够使多个节点在不共享原始数据的情况下协作训练机器学习模型而受到关注。与此同时，生成式人工智能——特别是生成对抗网络（GANs）——在医疗保健、安全和图像生成等广泛领域取得了显著成功。然而，训练生成模型通常需要大型数据集和大量计算资源，这在现实场景中往往难以获得。获取此类资源成本高昂且效率低下，尤其是在许多能力各异的未充分利用设备（如物联网设备和边缘设备）处于闲置状态时。此外，由于隐私担忧和版权限制，获取大型数据集具有挑战性，因为大多数设备不愿意共享其数据。为了应对这些挑战，我们提出了一种去中心化GAN训练的新方法，该方法能够利用分布式数据和未充分利用的低能力设备，同时不以原始形式共享数据。我们的方法旨在解决去中心化环境中的关键挑战，结合KLD加权聚类联邦学习以应对数据异构性和多域数据集问题，并结合异构U形分割学习以应对严格数据共享约束下的设备异构性挑战——确保节点之间从不共享任何标签或原始数据（无论是真实的还是合成的）。实验表明，我们的方法在关键指标上展现出显著改进，其中分类指标平均提升10%（在多域非独立同分布设置中最高可达60%），在MNIST系列数据集上图像生成分数提高1.1倍至3倍，在更高分辨率数据集上FID分数降低2倍至70倍。代码请访问 https://distributed-gen-ai.github.io/huscf-gan.github.io/。