The rise of large language models (LLMs) has created a significant disparity: industrial research labs with their computational resources, expert teams, and advanced infrastructures, can effectively fine-tune LLMs, while individual developers and small organizations face barriers due to limited resources. In this paper, we aim to bridge this gap by presenting a comprehensive study on supervised fine-tuning of LLMs using instruction-tuning datasets spanning diverse knowledge domains and skills. We focus on small-sized LLMs (3B to 7B parameters) for their cost-efficiency and accessibility. We explore various training configurations and strategies across four open-source pre-trained models. We provide detailed documentation of these configurations, revealing findings that challenge several common training practices, including hyperparameter recommendations from TULU and phased training recommended by Orca. Key insights from our work include: (i) larger batch sizes paired with lower learning rates lead to improved model performance on benchmarks such as MMLU, MTBench, and Open LLM Leaderboard; (ii) early-stage training dynamics, such as lower gradient norms and higher loss values, are strong indicators of better final model performance, enabling early termination of sub-optimal runs and significant computational savings; (iii) through a thorough exploration of hyperparameters like warmup steps and learning rate schedules, we provide guidance for practitioners and find that certain simplifications do not compromise performance; and (iv) we observed no significant difference in performance between phased and stacked training strategies, but stacked training is simpler and more sample efficient. With these findings holding robustly across datasets and models, we hope this study serves as a guide for practitioners fine-tuning small LLMs and promotes a more inclusive environment for LLM research.
翻译:大型语言模型(LLM)的兴起造成了显著的差距:拥有计算资源、专家团队和先进基础设施的工业研究实验室能够有效微调LLM,而个体开发者和小型组织则因资源有限面临障碍。本文旨在通过开展一项关于使用涵盖多知识领域与技能的指令微调数据集进行LLM监督微调的综合性研究,以弥合这一差距。我们聚焦于参数规模为30亿至70亿的小型LLM,因其具备成本效益与可及性优势。我们在四个开源预训练模型上探索了多种训练配置与策略,详细记录了这些配置,并揭示了挑战多项常见训练实践的发现,包括TULU的超参数建议与Orca推荐的分阶段训练方法。本研究的关键见解包括:(i)较大的批处理规模配合较低的学习率能提升模型在MMLU、MTBench和Open LLM Leaderboard等基准测试上的性能;(ii)早期训练动态(如较低的梯度范数与较高的损失值)是最终模型性能的强预测指标,可据此提前终止次优训练以显著节省计算资源;(iii)通过对预热步数、学习率调度等超参数的深入探索,我们为实践者提供指导,并发现某些简化策略不会损害性能;(iv)我们观察到分阶段训练与堆叠式训练策略在性能上无显著差异,但堆叠式训练更简洁且样本效率更高。鉴于这些发现在不同数据集与模型上均保持稳健,我们希望本研究能为微调小型LLM的实践者提供指导,并推动LLM研究环境更具包容性。