Low-Rank Adaptation (LoRA) has emerged as a dominant method in Parameter-Efficient Fine-Tuning (PEFT) for large language models, which augments the transformer layer with one down-projection $A$ and one up-projection $B$. However, LoRA's reliance on a single down-projection matrix ($A$) creates a representational bottleneck, as this solitary feature extractor is inherently insufficient for capturing the diverse signals required by complex tasks. This motivates our architectural shift to focus on enriching the feature adaptation to improve the downstream task adaptation ability. We propose MASA (Multi-$A$ Shared Adaptation), an architecture that implements a multi-$A$, single-$B$ structure where the multi-$A$ expert ensemble is asymmetrically shared across layers to ensure parameter efficiency. In MASA, these specialized experts capture diverse features, which are then integrated by a single, layer-specific $B$-matrix. The effectiveness and versatility of our method are validated through a comprehensive suite of experiments spanning multi-domain generalization, single-domain specialization, and multi-task reasoning. For example, on the MMLU benchmark, MASA achieves an average accuracy of 59.62%, outperforming the standard LoRA by 1.08 points (a relative improvement of 1.84%) with comparable learnable parameters of 0.52%.
翻译:低秩适配(LoRA)已成为大语言模型参数高效微调(PEFT)的主流方法,其通过在Transformer层中增加一个下投影矩阵$A$和一个上投影矩阵$B$来实现适配。然而,LoRA依赖单一的下投影矩阵($A$)会形成表征瓶颈,因为这一独立的特征提取器本质上难以捕捉复杂任务所需的多样化信号。这促使我们通过架构调整来丰富特征适配能力,以提升下游任务适应性能。我们提出MASA(多$A$共享适配)架构,采用多$A$单$B$结构,其中多$A$专家集合以非对称方式跨层共享以确保参数效率。在MASA中,这些专业化专家捕获多样化特征,随后通过单层专用的$B$矩阵进行整合。我们通过涵盖多领域泛化、单领域专业化及多任务推理的综合实验验证了该方法的有效性和通用性。例如在MMLU基准测试中,MASA以0.52%的可学习参数量实现了59.62%的平均准确率,较标准LoRA提升1.08个百分点(相对改进1.84%)。