The machine learning community has witnessed impressive advancements since the first appearance of large language models (LLMs), yet their huge memory consumption has become a major roadblock to large-scale training. Parameter Efficient Fine-Tuning techniques such as Low-Rank Adaptation (LoRA) have been proposed to alleviate this problem, but their performance still fails to match full parameter training in most large-scale fine-tuning settings. Attempting to complement this deficiency, we investigate layerwise properties of LoRA on fine-tuning tasks and observe an uncommon skewness of weight norms across different layers. Utilizing this key observation, a surprisingly simple training strategy is discovered, which outperforms both LoRA and full parameter training in a wide range of settings with memory costs as low as LoRA. We name it Layerwise Importance Sampled AdamW (LISA), a promising alternative for LoRA, which applies the idea of importance sampling to different layers in LLMs and randomly freeze most middle layers during optimization. Experimental results show that with similar or less GPU memory consumption, LISA surpasses LoRA or even full parameter tuning in downstream fine-tuning tasks, where LISA consistently outperforms LoRA by over $11\%$-$37\%$ in terms of MT-Bench scores. On large models, specifically LLaMA-2-70B, LISA achieves on-par or better performance than LoRA on MT-Bench, GSM8K, and PubMedQA, demonstrating its effectiveness across different domains.
翻译:机器学习领域自大语言模型(LLMs)首次出现以来见证了显著进步,但其巨大的内存消耗已成为大规模训练的主要障碍。参数高效微调技术如低秩适配(LoRA)虽被提出以缓解该问题,但在大多数大规模微调场景中性能仍无法匹配全参数训练。为弥补这一不足,我们研究了LoRA在微调任务中的层级性质,观察到不同层间存在非典型的权重范数偏斜。基于这一关键发现,我们意外发现一种极为简单的训练策略,在低至LoRA的内存成本下,于多种设置中均优于LoRA和全参数训练。我们将其命名为层级重要性采样AdamW(LISA),作为LoRA的有力替代方案,该算法将重要性采样思想应用于LLMs的不同层,在优化过程中随机冻结大部分中间层。实验结果表明,在相似或更少GPU内存消耗下,LISA在下游微调任务中超越LoRA甚至全参数微调,其中LISA在MT-Bench得分上持续以11%-37%的优势超越LoRA。在大型模型(特别是LLaMA-2-70B)上,LISA在MT-Bench、GSM8K和PubMedQA基准中达到与LoRA相当或更优的性能,证明了其跨领域有效性。