Fine-tuning large language models is a popular choice among users trying to adapt them for specific applications. However, fine-tuning these models is a demanding task because the user has to examine several factors, such as resource budget, runtime, model size and context length among others. A specific challenge is that fine-tuning is memory intensive, imposing constraints on the required hardware memory and context length of training data that can be handled. In this work, we share a detailed study on a variety of fine-tuning optimizations across different fine-tuning scenarios. In particular, we assess Gradient Checkpointing, Low-Rank Adaptation, DeepSpeed's Zero Redundancy Optimizer and FlashAttention. With a focus on memory and runtime, we examine the impact of different optimization combinations on GPU memory usage and execution runtime during fine-tuning phase. We provide our recommendation on the best default optimization for balancing memory and runtime across diverse model sizes. We share effective strategies for fine-tuning very large models with tens or hundreds of billions of parameters and enabling large context lengths during fine-tuning. Furthermore, we propose the appropriate optimization mixtures for fine-tuning under GPU resource limitations.
翻译:大语言模型微调是用户尝试将其适配特定应用场景的常见选择。然而,微调这些模型是一项艰巨任务,因为用户必须综合考虑资源预算、运行时间、模型规模和上下文长度等多个因素。其中一项特殊挑战在于微调过程对内存的高消耗,这限制了可处理的训练数据所需硬件内存和上下文长度。本研究针对不同微调场景下的多种优化策略进行了详细分析。具体而言,我们评估了梯度检查点(Gradient Checkpointing)、低秩适配(Low-Rank Adaptation)、DeepSpeed零冗余优化器(Zero Redundancy Optimizer)以及FlashAttention等方法。以内存消耗和运行时间为关注重点,我们研究了不同优化组合对微调阶段GPU显存使用量和执行时间的影响。针对不同模型规模,我们提出了平衡内存与运行时间的最佳默认优化建议,分享了有效微调包含数十亿乃至数千亿参数的超大规模模型并在微调过程中支持长上下文长度的策略。此外,我们还提出了在GPU资源受限条件下进行微调的优化组合方案。