In large language model training, input documents are typically concatenated together and then split into sequences of equal length to avoid padding tokens. Despite its efficiency, the concatenation approach compromises data integrity -- it inevitably breaks many documents into incomplete pieces, leading to excessive truncations that hinder the model from learning to compose logically coherent and factually consistent content that is grounded on the complete context. To address the issue, we propose Best-fit Packing, a scalable and efficient method that packs documents into training sequences through length-aware combinatorial optimization. Our method completely eliminates unnecessary truncations while retaining the same training efficiency as concatenation. Empirical results from both text and code pre-training show that our method achieves superior performance (e.g., relatively +4.7% on reading comprehension; +16.8% in context following; and +9.2% on program synthesis), and reduces closed-domain hallucination effectively by up to 58.3%.
翻译:在大语言模型训练中,输入文档通常被拼接起来并分割成等长序列,以避免填充标记。尽管这种拼接方法效率较高,但它破坏了数据完整性——不可避免地会将许多文档截断成不完整的片段,导致过多的截断操作,阻碍模型学习基于完整上下文生成逻辑连贯且事实一致的内容。为解决这一问题,我们提出了最佳适配打包法(Best-fit Packing),这是一种可扩展且高效的方法,通过考虑长度的组合优化将文档打包成训练序列。我们的方法在保持与拼接法相同训练效率的同时,完全消除了不必要的截断。来自文本和代码预训练的实证结果表明,我们的方法取得了更优的性能(例如,阅读理解相对提升4.7%;上下文遵循能力提升16.8%;程序合成提升9.2%),并将封闭域幻觉有效降低高达58.3%。