Recent Mixture-of-Experts (MoE)-based large language models (LLMs) such as Qwen-MoE and DeepSeek-MoE are transforming generative AI in natural language processing. However, these models require vast and diverse training data. Federated learning (FL) addresses this challenge by leveraging private data from heterogeneous edge devices for privacy-preserving MoE training. Nonetheless, traditional FL approaches require devices to host local MoE models, which is impractical for resource-constrained devices due to large model sizes. To address this, we propose DeepFusion, the first scalable federated MoE training framework that enables the fusion of heterogeneous on-device LLM knowledge via federated knowledge distillation, yielding a knowledge-abundant global MoE model. Specifically, DeepFusion features each device to independently configure and train an on-device LLM tailored to its own needs and hardware limitations. Furthermore, we propose a novel View-Aligned Attention (VAA) module that integrates multi-stage feature representations from the global MoE model to construct a predictive perspective aligned with on-device LLMs, thereby enabling effective cross-architecture knowledge distillation. By explicitly aligning predictive perspectives, VAA resolves the view-mismatch problem in traditional federated knowledge distillation, which arises from heterogeneity in model architectures and prediction behaviors between on-device LLMs and the global MoE model. Experiments with industry-level MoE models (Qwen-MoE and DeepSeek-MoE) and real-world datasets (medical and finance) demonstrate that DeepFusion achieves performance close to centralized MoE training. Compared with key federated MoE baselines, DeepFusion reduces communication costs by up to 71% and improves token perplexity by up to 5.28%.
翻译:近年来,基于专家混合(Mixture-of-Experts, MoE)的大型语言模型(Large Language Models, LLMs),如Qwen-MoE和DeepSeek-MoE,正在变革自然语言处理领域的生成式人工智能。然而,这些模型需要海量且多样化的训练数据。联邦学习(Federated Learning, FL)通过利用来自异构边缘设备的私有数据进行隐私保护的MoE训练,以应对这一挑战。尽管如此,传统的联邦学习方法要求设备托管本地MoE模型,这对于资源受限的设备而言,由于模型规模庞大,是不切实际的。为解决此问题,我们提出了DeepFusion,这是首个可扩展的联邦MoE训练框架,它通过联邦知识蒸馏实现异构设备端LLM知识的融合,从而得到一个知识丰富的全局MoE模型。具体而言,DeepFusion允许每个设备根据其自身需求和硬件限制,独立配置并训练一个设备端LLM。此外,我们提出了一种新颖的视图对齐注意力(View-Aligned Attention, VAA)模块,该模块整合来自全局MoE模型的多阶段特征表示,以构建一个与设备端LLMs对齐的预测视角,从而实现有效的跨架构知识蒸馏。通过显式地对齐预测视角,VAA解决了传统联邦知识蒸馏中因设备端LLMs与全局MoE模型在模型架构和预测行为上的异构性而产生的视图不匹配问题。使用工业级MoE模型(Qwen-MoE和DeepSeek-MoE)和真实世界数据集(医疗和金融)进行的实验表明,DeepFusion实现了接近集中式MoE训练的性能。与关键的联邦MoE基线方法相比,DeepFusion将通信成本降低了高达71%,并将词元困惑度提升了高达5.28%。