Fine-tuning large language models (LLMs) requires significant memory, often exceeding the capacity of a single GPU. A common solution to this memory challenge is offloading compute and data from the GPU to the CPU. However, this approach is hampered by the limited bandwidth of commodity hardware, which constrains communication between the CPU and GPU. In this paper, we present an offloading framework, LSP_Offload, that enables near-native speed LLM fine-tuning on commodity hardware through learned subspace projectors. Our data-driven approach involves learning an efficient sparse compressor that minimizes communication with minimal precision loss. Additionally, we introduce a novel layer-wise communication schedule to maximize parallelism between communication and computation. As a result, our framework can fine-tune a 1.3 billion parameter model on a 4GB laptop GPU and a 7 billion parameter model on an NVIDIA RTX 4090 GPU with 24GB memory, achieving only a 31% slowdown compared to fine-tuning with unlimited memory. Compared to state-of-the-art offloading frameworks, our approach increases fine-tuning throughput by up to 3.33 times and reduces end-to-end fine-tuning time by 33.1%~62.5% when converging to the same accuracy.
翻译:微调大型语言模型(LLMs)需要大量内存,通常超出单张GPU的容量。应对这一内存挑战的常见解决方案是将计算和数据从GPU卸载至CPU。然而,这种方法受到通用硬件有限带宽的制约,从而限制了CPU与GPU之间的通信。本文提出一种卸载框架LSP_Offload,该框架通过学习子空间投影器,在通用硬件上实现接近原生速度的LLM微调。我们的数据驱动方法通过学习一种高效的稀疏压缩器,以最小精度损失实现通信最小化。此外,我们引入了一种新颖的逐层通信调度策略,以最大化通信与计算之间的并行性。实验结果表明,我们的框架能够在4GB显存的笔记本电脑GPU上微调13亿参数模型,并在24GB显存的NVIDIA RTX 4090 GPU上微调70亿参数模型,与无限内存条件下的微调相比仅产生31%的速度下降。与现有最先进的卸载框架相比,在达到相同精度的前提下,我们的方法将微调吞吐量最高提升3.33倍,并将端到端微调时间减少33.1%~62.5%。