Transformer-based large language models (LLMs) have demonstrated outstanding performance across diverse domains, particularly when fine-turned for specific domains. Recent studies suggest that the resources required for fine-tuning LLMs can be economized through parameter-efficient methods such as Low-Rank Adaptation (LoRA). While LoRA effectively reduces computational burdens and resource demands, it currently supports only a single-job fine-tuning setup. In this paper, we present ASPEN, a high-throughput framework for fine-tuning LLMs. ASPEN efficiently trains multiple jobs on a single GPU using the LoRA method, leveraging shared pre-trained model and adaptive scheduling. ASPEN is compatible with transformer-based language models like LLaMA and ChatGLM, etc. Experiments show that ASPEN saves 53% of GPU memory when training multiple LLaMA-7B models on NVIDIA A100 80GB GPU and boosts training throughput by about 17% compared to existing methods when training with various pre-trained models on different GPUs. The adaptive scheduling algorithm reduces turnaround time by 24%, end-to-end training latency by 12%, prioritizing jobs and preventing out-of-memory issues.
翻译:摘要:基于Transformer架构的大型语言模型在多个领域展现出卓越性能,尤其在针对特定领域进行微调后表现更为突出。近期研究表明,通过低秩适配(LoRA)等参数高效方法可降低语言模型微调所需的资源消耗。尽管LoRA有效缓解了计算负担与资源需求,但现有实现仅支持单任务微调场景。本文提出ASPEN——一种面向语言模型微调的高通量框架。ASPEN通过共享预训练模型与自适应调度策略,在单张GPU上利用LoRA方法高效训练多个任务。该框架兼容LLaMA、ChatGLM等基于Transformer的语言模型。实验表明,在NVIDIA A100 80GB GPU上训练多个LLaMA-7B模型时,ASPEN可节省53%的GPU显存;相较于现有方法,在不同GPU上使用多种预训练模型进行训练时,训练吞吐量提升约17%。自适应调度算法使任务周转时间降低24%,端到端训练延迟减少12%,同时实现任务优先级管理并避免内存溢出问题。