LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning

The machine learning community has witnessed impressive advancements since large language models (LLMs) first appeared. Yet, their massive memory consumption has become a significant roadblock to large-scale training. For instance, a 7B model typically requires at least 60 GB of GPU memory with full parameter training, which presents challenges for researchers without access to high-resource environments. Parameter Efficient Fine-Tuning techniques such as Low-Rank Adaptation (LoRA) have been proposed to alleviate this problem. However, in most large-scale fine-tuning settings, their performance does not reach the level of full parameter training because they confine the parameter search to a low-rank subspace. Attempting to complement this deficiency, we investigate the layerwise properties of LoRA on fine-tuning tasks and observe an unexpected but consistent skewness of weight norms across different layers. Utilizing this key observation, a surprisingly simple training strategy is discovered, which outperforms both LoRA and full parameter training in a wide range of settings with memory costs as low as LoRA. We name it Layerwise Importance Sampled AdamW (LISA), a promising alternative for LoRA, which applies the idea of importance sampling to different layers in LLMs and randomly freezes most middle layers during optimization. Experimental results show that with similar or less GPU memory consumption, LISA surpasses LoRA or even full parameter tuning in downstream fine-tuning tasks, where LISA consistently outperforms LoRA by over 10%-35% in terms of MT-Bench score while achieving on-par or better performance in MMLU, AGIEval and WinoGrande. On large models, specifically LLaMA-2-70B, LISA surpasses LoRA on MT-Bench, GSM8K, and PubMedQA, demonstrating its effectiveness across different domains.

翻译：自大语言模型首次出现以来，机器学习领域已见证了令人瞩目的进步。然而，其巨大的内存消耗已成为大规模训练的重要障碍。例如，一个70亿参数的模型在全参数训练下通常需要至少60 GB的GPU内存，这对无法获取高资源环境的研究者构成了挑战。诸如低秩自适应等参数高效微调技术已被提出以缓解此问题。然而，在大多数大规模微调场景中，其性能未能达到全参数训练的水平，因为它们将参数搜索限制在低秩子空间内。为弥补这一不足，我们研究了LoRA在微调任务上的层级特性，并观察到不同层间权重范数存在意外但一致的偏斜分布。利用这一关键观察，我们发现了一种极其简单的训练策略，该策略在内存成本低至与LoRA相当的情况下，于广泛场景中超越了LoRA和全参数训练。我们将其命名为层级重要性采样AdamW，它是LoRA的一种有前景的替代方案，其将重要性采样的思想应用于大语言模型的不同层，并在优化过程中随机冻结大多数中间层。实验结果表明，在相似或更低的GPU内存消耗下，LISA在下游微调任务中超越了LoRA甚至全参数调优：在MT-Bench评分上持续优于LoRA 10%-35%，同时在MMLU、AGIEval和WinoGrande上达到相当或更好的性能。在大型模型（特别是LLaMA-2-70B）上，LISA在MT-Bench、GSM8K和PubMedQA上均超越LoRA，证明了其在不同领域的有效性。