Federated Fine-Tuning of Sparsely-Activated Large Language Models on Resource-Constrained Devices

Federated fine-tuning of Mixture-of-Experts (MoE)-based large language models (LLMs) is challenging due to their massive computational requirements and the resource constraints of participants. Existing working attempts to fill this gap through model quantization, computation offloading, or expert pruning. However, they cannot achieve desired performance due to impractical system assumptions and a lack of consideration for MoE-specific characteristics. In this paper, we propose FLUX, a system designed to enable federated fine-tuning of MoE-based LLMs across participants with constrained computing resources (e.g., consumer-grade GPUs), aiming to minimize time-to-accuracy. FLUX introduces three key innovations: (1) quantization-based local profiling to estimate expert activation with minimal overhead, (2) adaptive layer-aware expert merging to reduce resource consumption while preserving accuracy, and (3) dynamic expert role assignment using an exploration-exploitation strategy to balance tuning and non-tuning experts. Extensive experiments on LLaMA-MoE and DeepSeek-MoE with multiple benchmark datasets demonstrate that FLUX significantly outperforms existing methods, achieving up to 4.75X speedup in time-to-accuracy.

翻译：基于专家混合（MoE）架构的大语言模型（LLMs）的联邦微调因其巨大的计算需求与参与方的资源限制而极具挑战。现有工作试图通过模型量化、计算卸载或专家剪枝来填补这一空白。然而，由于不切实际的系统假设以及对MoE特有特性的考量不足，这些方法无法达到理想的性能。本文提出FLUX系统，旨在使计算资源受限（例如消费级GPU）的参与方能够对基于MoE的LLMs进行联邦微调，并以最小化达到目标精度所需时间为目标。FLUX引入了三项关键创新：（1）基于量化的本地性能剖析，以最小开销估计专家激活情况；（2）自适应的层感知专家合并，在保持精度的同时降低资源消耗；（3）采用探索-利用策略的动态专家角色分配，以平衡待微调与非微调专家。在LLaMA-MoE和DeepSeek-MoE模型上使用多个基准数据集进行的广泛实验表明，FLUX显著优于现有方法，在达到目标精度所需时间上实现了最高4.75倍的加速。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日