Large Language Models (LLMs) have recently gained popularity due to their impressive few-shot performance across various downstream tasks. However, fine-tuning all parameters and storing a unique model for each downstream task or domain becomes impractical because of the massive size of checkpoints (e.g., 350GB in GPT-3). Current literature, such as LoRA, showcases the potential of low-rank modifications to the original weights of an LLM, enabling efficient adaptation and storage for task-specific models. These methods can reduce the number of parameters needed to fine-tune an LLM by several orders of magnitude. Yet, these methods face two primary limitations: 1) the parameter reduction is lower-bounded by the rank one decomposition, and 2) the extent of reduction is heavily influenced by both the model architecture and the chosen rank. For instance, in larger models, even a rank one decomposition might exceed the number of parameters truly needed for adaptation. In this paper, we introduce NOLA, which overcomes the rank one lower bound present in LoRA. It achieves this by re-parameterizing the low-rank matrices in LoRA using linear combinations of randomly generated matrices (basis) and optimizing the linear mixture coefficients only. This approach allows us to decouple the number of trainable parameters from both the choice of rank and the network architecture. We present adaptation results using GPT-2 and ViT in natural language and computer vision tasks. NOLA performs as well as, or better than models with equivalent parameter counts. Furthermore, we demonstrate that we can halve the parameters in larger models compared to LoRA with rank one, without sacrificing performance.
翻译:大语言模型(LLMs)因在各类下游任务中展现出惊人的少样本学习能力而近期备受瞩目。然而,由于模型检查点规模庞大(例如GPT-3达350GB),为每个下游任务或领域微调全部参数并存储专属模型的做法变得不切实际。现有研究(如LoRA)表明,对LLM原始权重进行低秩修改具有巨大潜力,可实现任务特定模型的高效适配与存储。这类方法能将微调LLM所需的参数量降低数个数量级。但当前方法面临两大局限:1)参数压缩受限于秩一分解的下界;2)压缩程度严重依赖模型架构与所选秩值。例如,在更大规模模型中,即使采用秩一分解,其参数量仍可能超过适配所需的实际参数规模。本文提出NOLA方法,通过将LoRA中的低秩矩阵重新参数化为随机生成矩阵(基)的线性组合,并仅优化线性混合系数,成功突破LoRA中秩一下界的限制。该策略使可训练参数数量与秩选择及网络架构实现解耦。我们在自然语言与计算机视觉任务中,分别采用GPT-2与ViT模型进行适配实验。结果表明,NOLA在性能上可与同等参数量的模型相媲美甚至更优。此外,我们证实相较于采用秩一分解的LoRA,NOLA能在保持性能不变的前提下,将大模型的参数量再减半。