Federated fine-tuning of Mixture-of-Experts (MoE)-based large language models (LLMs) is challenging due to their massive computational requirements and the resource constraints of participants. Existing working attempts to fill this gap through model quantization, computation offloading, or expert pruning. However, they cannot achieve desired performance due to impractical system assumptions and a lack of consideration for MoE-specific characteristics. In this paper, we propose FLUX, a system designed to enable federated fine-tuning of MoE-based LLMs across participants with constrained computing resources (e.g., consumer-grade GPUs), aiming to minimize time-to-accuracy. FLUX introduces three key innovations: (1) quantization-based local profiling to estimate expert activation with minimal overhead, (2) adaptive layer-aware expert merging to reduce resource consumption while preserving accuracy, and (3) dynamic expert role assignment using an exploration-exploitation strategy to balance tuning and non-tuning experts. Extensive experiments on LLaMA-MoE and DeepSeek-MoE with multiple benchmark datasets demonstrate that FLUX significantly outperforms existing methods, achieving up to 4.75X speedup in time-to-accuracy.
翻译:基于专家混合(MoE)架构的大语言模型(LLMs)的联邦微调因其巨大的计算需求与参与方的资源限制而极具挑战。现有工作试图通过模型量化、计算卸载或专家剪枝来填补这一空白。然而,由于不切实际的系统假设以及对MoE特有特性的考量不足,这些方法无法达到理想的性能。本文提出FLUX系统,旨在使计算资源受限(例如消费级GPU)的参与方能够对基于MoE的LLMs进行联邦微调,并以最小化达到目标精度所需时间为目标。FLUX引入了三项关键创新:(1)基于量化的本地性能剖析,以最小开销估计专家激活情况;(2)自适应的层感知专家合并,在保持精度的同时降低资源消耗;(3)采用探索-利用策略的动态专家角色分配,以平衡待微调与非微调专家。在LLaMA-MoE和DeepSeek-MoE模型上使用多个基准数据集进行的广泛实验表明,FLUX显著优于现有方法,在达到目标精度所需时间上实现了最高4.75倍的加速。