Low-Rank Adaptation (LoRA) is now the dominant method for parameter-efficient fine-tuning of large language models, but achieving a high-quality adapter often requires systematic hyperparameter tuning because LoRA performance is highly sensitive to configuration choices. In practice, this leads to many concurrent LoRA jobs, often spanning heterogeneous tasks in multi-tenant environments. Existing systems largely handle these jobs independently, which both wastes computation on weak candidates and leaves GPUs underutilized. We present ALTO (Adaptive LoRA Tuning and Orchestration), a co-designed training system that accelerates LoRA hyperparameter tuning while enabling efficient cluster sharing across heterogeneous tasks. The central insight behind ALTO is that when multiple tuning jobs run concurrently over a shared frozen backbone, they expose optimization opportunities that single-job designs cannot exploit. Building on this, ALTO monitors loss trajectories to terminate unpromising configurations early, uses fused grouped GEMM together with a new rank-local adapter parallelism to co-locate surviving adapters and reclaim freed GPU capacity, and combines intra-task and inter-task scheduling to improve multi-task placement by leveraging the predictable duration of LoRA jobs. Extensive evaluation shows that ALTO achieves up to $13.8\times$ speedup over state-of-the-art without sacrificing adapter quality.
翻译:[translated abstract in Chinese]
低秩自适应(LoRA)目前是大语言模型参数高效微调的主流方法,但由于LoRA性能对配置选择高度敏感,获取高质量适配器通常需要系统性的超参数调优。实践中,这会产生大量并发的LoRA任务,在多租户环境中往往涉及跨异构任务的作业。现有系统大多独立处理这些任务,既浪费了针对弱候选方案的计算资源,又导致GPU利用率不足。我们提出ALTO(自适应LoRA调优与编排),一种协同设计的训练系统,在加速LoRA超参数调优的同时,支持异构任务间高效的集群共享。ALTO的核心洞察在于:当多个调优任务在共享冻结基座模型上并发运行时,会暴露出单任务设计无法利用的优化机会。基于此,ALTO通过监控损失轨迹提前终止无前景的配置,采用融合分组GEMM与新提出的rank-local适配器并行技术共生存在潜力的适配器并回收空闲GPU容量,同时结合任务内与任务间调度,利用LoRA任务时长可预测的特性优化多任务放置。大量评估表明,ALTO在保证适配器质量前提下,相比现有最优方案实现了高达13.8倍的加速。