This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to $100$B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of $3$. For a $3$B parameter model, it improves the post-training performance by over $10\%$ on several challenging reasoning benchmarks.
翻译:本文提出了一种简单且可扩展的方法,通过为现有文本数据增强思维轨迹来提高大型语言模型训练的数据效率。当前,大型语言模型的预训练算力需求正以前所未有的速度增长,而高质量数据的可用性仍然有限。因此,如何最大化可用数据的效用构成了一个重要的研究挑战。一个主要障碍在于,在给定固定模型容量的情况下,某些高质量标记难以被有效学习,因为单个标记背后的基本原理可能异常复杂且深奥。为解决这一问题,我们提出了思维增强预训练,这是一种通过自动生成的思维轨迹来增强文本的通用方法。这种增强有效增加了训练数据的规模,并通过逐步推理和分解使高质量标记更易于学习。我们在高达1000亿标记的多样化训练配置中应用了TPT,涵盖了数据受限与数据充足的预训练场景,以及从优质开源检查点进行的中期训练。实验结果表明,我们的方法显著提升了不同规模和架构的大型语言模型的性能。值得注意的是,TPT将大型语言模型预训练的数据效率提高了3倍。对于一个30亿参数的模型,该方法在多个具有挑战性的推理基准测试中,使训练后性能提升了超过10%。