mLoRA: Fine-Tuning LoRA Adapters via Highly-Efficient Pipeline Parallelism in Multiple GPUs

Transformer-based, pre-trained large language models (LLMs) have demonstrated outstanding performance across diverse domains, particularly in the emerging {\em pretrain-then-finetune} paradigm. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is commonly used to adapt a base LLM to multiple downstream tasks. Further, LLM platforms enable developers to fine-tune multiple models and develop various domain-specific applications simultaneously. However, existing model parallelism schemes suffer from high communication overhead and inefficient GPU utilization when training multiple LoRA tasks across GPUs and machines. In this paper, we present mLoRA, a parallelism-efficient fine-tuning system designed for training multiple LoRA across GPUs and machines. mLoRA introduces a novel LoRA-aware pipeline parallelism scheme that efficiently pipelines independent LoRA adapters and their distinct fine-tuning stages across GPUs and machines, along with a new LoRA-efficient operator to enhance GPU utilization during pipelined LoRA training. Our extensive evaluation shows that mLoRA can significantly reduce average fine-tuning task completion time, e.g., by 30\%, compared to state-of-the-art methods like FSDP. More importantly, mLoRA enables simultaneous fine-tuning of larger models, e.g., two Llama-2-13B models on four NVIDIA RTX A6000 48GB GPUs, which is not feasible for FSDP due to high memory requirements. Hence, mLoRA not only increases fine-tuning efficiency but also makes it more accessible on cost-effective GPUs. mLoRA has been deployed in AntGroup's production environment.

翻译：基于Transformer的预训练大语言模型（LLM）在多个领域展现出卓越性能，尤其是在新兴的“预训练-后微调”范式中。低秩适配（LoRA）作为一种参数高效的微调方法，常被用于将基础LLM适配至多个下游任务。此外，LLM平台使开发者能够同时微调多个模型并开发各类领域专用应用。然而，现有模型并行方案在跨GPU和机器训练多个LoRA任务时，存在通信开销高、GPU利用率低的问题。本文提出mLoRA，一种专为跨GPU和机器训练多个LoRA设计的并行高效微调系统。mLoRA引入了一种新颖的LoRA感知流水线并行方案，能够跨GPU和机器高效流水化处理独立的LoRA适配器及其不同的微调阶段，并设计了新的LoRA高效算子以提升流水线式LoRA训练期间的GPU利用率。我们的大量实验表明，与FSDP等先进方法相比，mLoRA能显著降低平均微调任务完成时间（例如降低30%）。更重要的是，mLoRA支持同时微调更大规模的模型（例如在四张NVIDIA RTX A6000 48GB GPU上并行微调两个Llama-2-13B模型），而FSDP因内存需求过高无法实现此类任务。因此，mLoRA不仅提升了微调效率，还使其在成本效益更高的GPU上更易实现。mLoRA已在蚂蚁集团的生产环境中部署。