Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs

Aldo Pareja,Nikhil Shivakumar Nayak,Hao Wang,Krishnateja Killamsetty,Shivchander Sudalairaj,Wenlong Zhao,Seungwook Han,Abhishek Bhandwaldar,Guangxuan Xu,Kai Xu,Ligong Han,Luke Inglis,Akash Srivastava

from arxiv, 33 pages, 19 figures. Appendix included in submission. Submitted to ICLR 2025

The rise of large language models (LLMs) has created a significant disparity: industrial research labs with their computational resources, expert teams, and advanced infrastructures, can effectively fine-tune LLMs, while individual developers and small organizations face barriers due to limited resources. In this paper, we aim to bridge this gap by presenting a comprehensive study on supervised fine-tuning of LLMs using instruction-tuning datasets spanning diverse knowledge domains and skills. We focus on small-sized LLMs (3B to 7B parameters) for their cost-efficiency and accessibility. We explore various training configurations and strategies across four open-source pre-trained models. We provide detailed documentation of these configurations, revealing findings that challenge several common training practices, including hyperparameter recommendations from TULU and phased training recommended by Orca. Key insights from our work include: (i) larger batch sizes paired with lower learning rates lead to improved model performance on benchmarks such as MMLU, MTBench, and Open LLM Leaderboard; (ii) early-stage training dynamics, such as lower gradient norms and higher loss values, are strong indicators of better final model performance, enabling early termination of sub-optimal runs and significant computational savings; (iii) through a thorough exploration of hyperparameters like warmup steps and learning rate schedules, we provide guidance for practitioners and find that certain simplifications do not compromise performance; and (iv) we observed no significant difference in performance between phased and stacked training strategies, but stacked training is simpler and more sample efficient. With these findings holding robustly across datasets and models, we hope this study serves as a guide for practitioners fine-tuning small LLMs and promotes a more inclusive environment for LLM research.

翻译：大型语言模型（LLM）的兴起造成了显著的差距：拥有计算资源、专家团队和先进基础设施的工业研究实验室能够有效微调LLM，而个体开发者和小型组织则因资源有限面临障碍。本文旨在通过开展一项关于使用涵盖多知识领域与技能的指令微调数据集进行LLM监督微调的综合性研究，以弥合这一差距。我们聚焦于参数规模为30亿至70亿的小型LLM，因其具备成本效益与可及性优势。我们在四个开源预训练模型上探索了多种训练配置与策略，详细记录了这些配置，并揭示了挑战多项常见训练实践的发现，包括TULU的超参数建议与Orca推荐的分阶段训练方法。本研究的关键见解包括：（i）较大的批处理规模配合较低的学习率能提升模型在MMLU、MTBench和Open LLM Leaderboard等基准测试上的性能；（ii）早期训练动态（如较低的梯度范数与较高的损失值）是最终模型性能的强预测指标，可据此提前终止次优训练以显著节省计算资源；（iii）通过对预热步数、学习率调度等超参数的深入探索，我们为实践者提供指导，并发现某些简化策略不会损害性能；（iv）我们观察到分阶段训练与堆叠式训练策略在性能上无显著差异，但堆叠式训练更简洁且样本效率更高。鉴于这些发现在不同数据集与模型上均保持稳健，我们希望本研究能为微调小型LLM的实践者提供指导，并推动LLM研究环境更具包容性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日