Thinking Augmented Pre-training

This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to $100$B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of $3$. For a $3$B parameter model, it improves the post-training performance by over $10\%$ on several challenging reasoning benchmarks.

翻译：本文提出了一种简单且可扩展的方法，通过为现有文本数据增强思维轨迹来提高大型语言模型训练的数据效率。当前，大型语言模型的预训练算力需求正以前所未有的速度增长，而高质量数据的可用性仍然有限。因此，如何最大化可用数据的效用构成了一个重要的研究挑战。一个主要障碍在于，在给定固定模型容量的情况下，某些高质量标记难以被有效学习，因为单个标记背后的基本原理可能异常复杂且深奥。为解决这一问题，我们提出了思维增强预训练，这是一种通过自动生成的思维轨迹来增强文本的通用方法。这种增强有效增加了训练数据的规模，并通过逐步推理和分解使高质量标记更易于学习。我们在高达1000亿标记的多样化训练配置中应用了TPT，涵盖了数据受限与数据充足的预训练场景，以及从优质开源检查点进行的中期训练。实验结果表明，我们的方法显著提升了不同规模和架构的大型语言模型的性能。值得注意的是，TPT将大型语言模型预训练的数据效率提高了3倍。对于一个30亿参数的模型，该方法在多个具有挑战性的推理基准测试中，使训练后性能提升了超过10%。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日