执行链监督促进大语言模型的通用推理能力 (Chain of Execution Supervision Promotes General Reasoning in Large Language Models)

Building robust and general reasoning ability is a central goal in the development of large language models (LLMs). Recent efforts increasingly turn to code as a rich training source, given its inherent logical structure and diverse reasoning paradigms such as divide-and-conquer, topological ordering, and enumeration. However, reasoning in code is often expressed implicitly and entangled with syntactic or implementation noise, making direct training on raw code suboptimal.To address this, we introduce TracePile, a large-scale corpus of 2.6 million samples that transforms code execution into explicit, step-by-step chain-of-thought-style rationales, which we call Chain of Execution (CoE). The corpus spans domains including mathematics, classical algorithms and algorithmic competition, and is enriched with variable-tracing questions and code rewritings to enhance logical granularity and code diversity. We evaluate TracePile using three training setups: continue-pretraining, instruction tuning after pretraining, and two-stage finetuning. Experiments across four base models (LLaMA 3, LLaMA 3.1, Qwen-2.5, and Qwen-2.5 Coder) and 20 benchmarks covering math, code, logic, and algorithms demonstrate consistent improvements. Notably, TracePile boosts LLaMA3.1-8B by 7.1\% on average across nine math datasets and delivers clear gains on LiveCodeBench, CRUX, and MMLU under two-stage fine-tuning.

翻译：构建稳健且通用的推理能力是大语言模型发展的核心目标。鉴于代码固有的逻辑结构和多样化的推理范式（如分治法、拓扑排序和枚举），近期研究越来越多地将代码作为丰富的训练源。然而，代码中的推理通常以隐式方式表达，并与语法或实现噪声交织，使得直接对原始代码进行训练效果欠佳。为解决此问题，我们引入了TracePile——一个包含260万个样本的大规模语料库，它将代码执行转化为显式的、逐步的思维链式推理依据，我们称之为执行链。该语料库涵盖数学、经典算法和算法竞赛等领域，并通过变量追踪问题和代码重写进行增强，以提升逻辑粒度和代码多样性。我们使用三种训练设置评估TracePile：继续预训练、预训练后的指令微调以及两阶段微调。在四个基础模型（LLaMA 3、LLaMA 3.1、Qwen-2.5和Qwen-2.5 Coder）和涵盖数学、代码、逻辑与算法的20个基准测试上的实验均显示出持续的改进。值得注意的是，TracePile将LLaMA3.1-8B在九个数学数据集上的平均性能提升了7.1%，并在两阶段微调下于LiveCodeBench、CRUX和MMLU基准上取得了显著增益。