Procedural Pretraining: Warming Up Language Models with Abstract Data

Pretraining directly on web-scale corpora is the de facto paradigm for building language models. We study an alternative setting where the model is initially exposed to abstract structured data, as a means to ease the subsequent acquisition of rich semantic knowledge, much like humans learn simple logic and mathematics before higher reasoning. We specifically focus on procedural data, generated by formal languages and other simple algorithms, as such abstract data. We first diagnose the algorithmic skills that different forms of procedural data can improve, often significantly. For example, on context recall (Needle-in-a-haystack), the accuracy jumps from 10 to 98% when pretraining on Dyck sequences (balanced brackets). Second, we study how these gains are reflected in pretraining larger models (up to 1.3B). We find that front-loading as little as 0.1% procedural data significantly outperforms standard pretraining on natural language, code, and informal mathematics (C4, CodeParrot, and DeepMind-Math datasets). Notably, this procedural pretraining enables the models to reach the same loss value with only 55, 67, 86% of the original data. Third, we explore the mechanisms behind and find that procedural pretraining instils non-trivial structure in both attention and MLP layers. The former is particularly important for structured domains (e.g. code), and the latter for language. Finally, we lay a path for combining multiple forms of procedural data. Our results show that procedural pretraining is a simple, lightweight means to improving performance and accelerating language model pretraining, ultimately suggesting the promise of disentangling knowledge acquisition from reasoning in LLMs.

翻译：直接在网络规模语料库上进行预训练是构建语言模型的既定范式。我们研究了一种替代方案：模型最初接触抽象结构化数据，以此作为促进后续丰富语义知识习得的手段，类似于人类在学习高级推理之前先学习简单逻辑和数学。我们特别关注由形式语言和其他简单算法生成的过程数据作为此类抽象数据。首先，我们诊断了不同形式的过程数据能够显著提升的算法技能。例如，在上下文回忆（大海捞针）任务中，当在Dyck序列（平衡括号）上进行预训练时，准确率从10%跃升至98%。其次，我们研究了这些增益在预训练更大模型（高达13亿参数）时如何体现。我们发现，仅在前端加载0.1%的过程数据，其表现就显著优于在自然语言、代码和非正式数学（C4、CodeParrot和DeepMind-Math数据集）上的标准预训练。值得注意的是，这种过程预训练使模型仅需原始数据量的55%、67%和86%即可达到相同的损失值。第三，我们探索了其背后的机制，发现过程预训练在注意力层和MLP层中都注入了非平凡的结构。前者对于结构化领域（如代码）尤为重要，后者则对语言处理更为关键。最后，我们为结合多种形式的过程数据奠定了基础。我们的结果表明，过程预训练是一种简单、轻量级的方法，能够提升性能并加速语言模型的预训练，最终揭示了在大型语言模型中实现知识获取与推理能力解耦的潜力。