Getting language models to reason correctly about code requires training on data where each reasoning step can be checked. Current synthetic Chain-of-Thought (CoT) training data often consists of plausible-sounding explanations generated by teacher models, and not verifiable accounts of actual program behavior. Models trained on such data learn logically flawed reasoning patterns despite syntactic correctness. To address this, we build a pipeline that generates execution-trace-verified CoT rationales by instrumenting code to capture traces, narrating them into natural language, and cross-checking each narration against the original trace. We systematically create 54,000 verified, bi-directional rationales that teach models to reason both forward (input$\rightarrow$output) and backward (output$\rightarrow$input). Models fine-tuned on our verified data achieve substantial improvements, with a peak gain of +26.6 on LiveCodeBench-Exec, +22.2 on CruxEval, and +19.5 on HumanEval across our fine-tuned models, demonstrating that verification quality directly determines both reasoning and code generation capabilities. Complete synthesis pipeline is avilable as open-source: https://github.com/IBM/verified-code-cot/
翻译:让语言模型正确推理代码需要训练数据中每个推理步骤均可验证。当前合成的思维链(CoT)训练数据通常包含教师模型生成的看似合理的解释,而非对实际程序行为的可验证描述。在此类数据上训练的模型虽具备句法正确性,却会习得逻辑有缺陷的推理模式。为解决该问题,我们构建了一个流水线,通过插桩代码捕获执行轨迹,将其转化为自然语言叙述,并逐一对照原始轨迹进行交叉验证,从而生成经执行轨迹验证的CoT推理依据。我们系统性地创建了54,000条可验证双向推理依据,能够训练模型同时进行前向推理(输入→输出)与反向推理(输出→输入)。基于可验证数据微调的模型取得了显著改进,其中LiveCodeBench-Exec提升+26.6,CruxEval提升+22.2,HumanEval提升+19.5(均为峰值增益),证明验证质量直接决定推理与代码生成能力。完整合成流水线已开源:https://github.com/IBM/verified-code-cot/