Packing, initially utilized in the pre-training phase, is an optimization technique designed to maximize hardware resource efficiency by combining different training sequences to fit the model's maximum input length. Although it has demonstrated effectiveness during pre-training, there remains a lack of comprehensive analysis for the supervised fine-tuning (SFT) stage on the following points: (1) whether packing can effectively enhance training efficiency while maintaining performance, (2) the suitable size of the model and dataset for fine-tuning with the packing method, and (3) whether packing unrelated or related training samples might cause the model to either excessively disregard or over-rely on the context. In this paper, we perform extensive comparisons between SFT methods using padding and packing, covering SFT datasets ranging from 69K to 1.2M and models from 8B to 70B. This provides the first comprehensive analysis of the advantages and limitations of packing versus padding, as well as practical considerations for implementing packing in various training scenarios. Our analysis covers various benchmarks, including knowledge, reasoning, and coding, as well as GPT-based evaluations, time efficiency, and other fine-tuning parameters. We also open-source our code for fine-tuning and evaluation and provide checkpoints fine-tuned on datasets of different sizes, aiming to advance future research on packing methods. Code is available at: https://github.com/ShuheWang1998/Packing-Analysis?tab=readme-ov-file.
翻译:打包(Packing)最初应用于预训练阶段,是一种通过组合不同训练序列以适应模型最大输入长度,从而最大化硬件资源利用率的优化技术。尽管其在预训练阶段已显示出有效性,但对于监督微调(SFT)阶段,以下几点仍缺乏全面分析:(1)打包能否在保持性能的同时有效提升训练效率;(2)采用打包方法进行微调时,模型和数据集的适宜规模;(3)打包不相关或相关的训练样本是否会导致模型过度忽略或过度依赖上下文。本文对使用填充(padding)与打包的SFT方法进行了广泛比较,涵盖了从69K到1.2M的SFT数据集以及从8B到70B的模型。这首次全面分析了打包与填充相比的优势与局限,以及在不同训练场景中实施打包的实际考量。我们的分析覆盖了多种基准测试,包括知识、推理和编码能力,以及基于GPT的评估、时间效率和其他微调参数。我们还开源了用于微调和评估的代码,并提供了在不同规模数据集上微调的检查点,旨在推动打包方法未来的研究。代码发布于:https://github.com/ShuheWang1998/Packing-Analysis?tab=readme-ov-file。