As Low-Rank Adaptation (LoRA) becomes the standard approach for efficiently fine-tuning large language models (LLMs), shared clusters increasingly execute many concurrent LoRA training jobs over the same frozen backbone. While recent advances enable batching (co-locating) multiple adapters during serving, efficient training-time co-location of heterogeneous LoRA adapters presents unique challenges. Jobs often differ in adapter rank, batch size, and resource allocation, and naïve batching can introduce synchronization stalls, communication overheads, and per-job slowdowns that are worse than executing independently. We introduce tLoRA, a framework that enables efficient batch training of multiple LoRA jobs. tLoRA fuses adapters that share the same base model into an elastic shared super-model, exploiting existing distributed training frameworks to derive parallelism plans that share resources effectively. At the kernel level, tLoRA employs a fused LoRA kernel that adaptively reconstructs low-rank computation tiles and schedules rank-aware nano-batches to maximize overlap between computation and communication across adapters. At the scheduling layer, tLoRA incorporates an online, residual-capacity-aware scheduler that adaptively groups jobs to maximize collective throughput. Evaluations using real-world cluster traces demonstrate that tLoRA improves training throughput by 1.2--1.8x, job training completion time by 2.3--5.4x, and GPU utilization by 37%.
翻译:随着低秩适应(LoRA)成为高效微调大语言模型(LLMs)的标准方法,共享集群越来越多地在同一冻结骨干模型上并发执行多个LoRA训练任务。尽管近期进展使得在服务阶段能够对多个适配器进行批处理(共置),但异构LoRA适配器在训练时的高效共置仍面临独特挑战。不同任务通常在适配器秩、批大小和资源分配上存在差异,简单的批处理可能引入同步延迟、通信开销以及单任务减速,其效果甚至比独立执行更差。本文提出tLoRA,一个支持多LoRA任务高效批量训练的框架。tLoRA将共享同一基础模型的适配器融合为一个弹性共享超模型,利用现有分布式训练框架推导出有效共享资源的并行化方案。在核心计算层面,tLoRA采用融合LoRA内核,自适应地重构低秩计算块,并调度秩感知的纳米批次,以最大化不同适配器间计算与通信的重叠。在调度层,tLoRA引入一个在线的、残差容量感知调度器,自适应地对任务进行分组以最大化集体吞吐量。基于真实集群轨迹的评估表明,tLoRA将训练吞吐量提升1.2–1.8倍,任务训练完成时间缩短2.3–5.4倍,GPU利用率提高37%。