Why Gradient Subspace? Identifying and Mitigating LoRA's Bottlenecks in Federated Fine-Tuning of Large Language Models

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, particularly in task generalization for both text and vision data. While fine-tuning these models can significantly enhance their performance on specific downstream tasks, it often requires high-quality data that cannot be shared due to privacy concerns. Federated Learning (FL) offers a promising solution for collaborative training without direct data sharing. However, many parameter-efficient fine-tuning strategies for LLMs in FL, particularly those based on Low-Rank Adaptation (LoRA), face limitations. In this paper, we critically analyze the convergence and performance guarantees of popular FL frameworks utilizing LoRA, highlighting its suboptimal nature due to constrained subspace learning of low-rank matrices. This limitation hinders effective fine-tuning of LLMs in federated settings. Through rigorous analytical and empirical evaluations, we demonstrate that direct weight averaging outperforms LoRA-based strategies, leading to superior performance for fine-tuned models. Our comprehensive comparison exposes inefficiencies in LoRA approaches and underscores the advantages of direct weight aggregation. We extend our analysis to low-rank gradient-based optimizers, such as GaLore, used during local training steps. Our findings show that GaLore is a more effective alternative, outperforming federated LoRA methods like FlexLoRA and FFA-LoRA across both text and image modalities. While privacy remains paramount in FL discourse, our focus is on assessing performance outcomes of federated fine-tuned models and evaluating various FL frameworks from both theoretical and empirical perspectives. Our findings advocate reassessing the reliance on LoRA within FL contexts, paving the way for more efficient training methodologies.

翻译：大语言模型（LLMs）已在多个领域展现出卓越能力，尤其在文本和视觉数据的任务泛化方面表现突出。尽管微调这些模型能显著提升其在特定下游任务上的性能，但通常需要高质量数据，而此类数据常因隐私问题无法共享。联邦学习（FL）为无需直接数据共享的协同训练提供了有前景的解决方案。然而，联邦学习中针对大语言模型的许多参数高效微调策略，尤其是基于低秩适应（LoRA）的方法，面临诸多局限。本文批判性分析了采用LoRA的主流联邦学习框架的收敛性与性能保证，指出其因低秩矩阵的受限子空间学习而导致的次优性。这一局限阻碍了大语言模型在联邦环境下的有效微调。通过严谨的理论分析与实证评估，我们证明直接权重平均策略优于基于LoRA的方法，能为微调模型带来更卓越的性能。我们的全面对比揭示了LoRA方法的低效性，并凸显了直接权重聚合的优势。我们将分析延伸至本地训练阶段使用的低秩梯度优化器（如GaLore）。研究结果表明，GaLore是更有效的替代方案，在文本和图像模态上均优于FlexLoRA、FFA-LoRA等联邦LoRA方法。尽管隐私保护始终是联邦学习讨论的核心议题，本文重点在于评估联邦微调模型的性能表现，并从理论与实证双重视角审视各类联邦学习框架。我们的发现主张重新评估联邦学习中对LoRA的依赖，为开发更高效的训练方法开辟新路径。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日