Why Gradient Subspace? Identifying and Mitigating LoRA's Bottlenecks in Federated Fine-Tuning of Large Language Models

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, particularly in task generalization for both text and vision data. While fine-tuning these models can significantly enhance their performance on specific downstream tasks, it often requires high-quality data that cannot be shared due to privacy concerns. Federated Learning (FL) offers a promising solution for collaborative training without direct data sharing. However, many parameter-efficient fine-tuning strategies for LLMs in FL, particularly those based on Low-Rank Adaptation (LoRA), face limitations. In this paper, we critically analyze the convergence and performance guarantees of popular FL frameworks utilizing LoRA, highlighting its suboptimal nature due to constrained subspace learning of low-rank matrices. This limitation hinders effective fine-tuning of LLMs in federated settings. Through rigorous analytical and empirical evaluations, we demonstrate that direct weight averaging outperforms LoRA-based strategies, leading to superior performance for fine-tuned models. Our comprehensive comparison exposes inefficiencies in LoRA approaches and underscores the advantages of full-rank weight aggregation. We extend our analysis to low-rank gradient-based optimizers, such as GaLore, used during local training steps. Our findings show that GaLore is a more effective alternative, outperforming federated LoRA methods like FlexLoRA and FFA-LoRA across both text and image modalities. While privacy remains paramount in FL discourse, our focus is on assessing performance outcomes of federated fine-tuned models and evaluating various FL frameworks from both theoretical and empirical perspectives. Our findings advocate reassessing the reliance on LoRA within FL contexts, paving the way for more efficient training methodologies.

翻译：大语言模型（LLM）已在多个领域展现出卓越能力，尤其在文本和视觉数据的任务泛化方面表现突出。尽管微调这些模型能显著提升其在特定下游任务上的性能，但通常需要高质量数据，而这些数据因隐私问题无法共享。联邦学习（FL）为无需直接数据共享的协同训练提供了有前景的解决方案。然而，许多针对FL中LLM的参数高效微调策略，尤其是基于低秩自适应（LoRA）的方法，存在局限性。本文对采用LoRA的流行FL框架的收敛性和性能保证进行了批判性分析，指出其因低秩矩阵的受限子空间学习而导致的次优特性。这一限制阻碍了LLM在联邦环境下的有效微调。通过严谨的理论分析和实证评估，我们证明直接权重平均优于基于LoRA的策略，能为微调模型带来更优性能。我们的全面比较揭示了LoRA方法的低效性，并凸显了全秩权重聚合的优势。我们将分析扩展至本地训练步骤中使用的低秩梯度优化器（如GaLore）。研究结果表明，GaLore是一种更有效的替代方案，在文本和图像模态上均优于FlexLoRA、FFA-LoRA等联邦LoRA方法。尽管隐私问题在FL讨论中至关重要，但本文重点在于评估联邦微调模型的性能结果，并从理论和实证角度评估各类FL框架。我们的发现主张重新评估FL场景中对LoRA的依赖，为开发更高效的训练方法铺平道路。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日