Chain of Code: Reasoning with a Language Model-Augmented Code Emulator

Code provides a general syntactic structure to build complex programs and perform precise computations when paired with a code interpreter - we hypothesize that language models (LMs) can leverage code-writing to improve Chain of Thought reasoning not only for logic and arithmetic tasks, but also for semantic ones (and in particular, those that are a mix of both). For example, consider prompting an LM to write code that counts the number of times it detects sarcasm in an essay: the LM may struggle to write an implementation for "detect_sarcasm(string)" that can be executed by the interpreter (handling the edge cases would be insurmountable). However, LMs may still produce a valid solution if they not only write code, but also selectively "emulate" the interpreter by generating the expected output of "detect_sarcasm(string)" and other lines of code that cannot be executed. In this work, we propose Chain of Code (CoC), a simple yet surprisingly effective extension that improves LM code-driven reasoning. The key idea is to encourage LMs to format semantic sub-tasks in a program as flexible pseudocode that the interpreter can explicitly catch undefined behaviors and hand off to simulate with an LM (as an "LMulator"). Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over Chain of Thought. CoC scales well with large and small models alike, and broadens the scope of reasoning questions that LMs can correctly answer by "thinking in code". Project webpage: https://chain-of-code.github.io.

翻译：代码提供了一种通用的语法结构来构建复杂程序，并在与代码解释器配合时执行精确计算——我们假设语言模型（LM）可以利用代码编写来改进思维链推理，不仅适用于逻辑和算术任务，也适用于语义任务（尤其是两者混合的任务）。例如，考虑引导LM编写代码来统计一篇文章中检测到反讽的次数：LM可能难以实现可被解释器执行的"detect_sarcasm(string)"函数（处理边缘情况将极其困难）。然而，如果LM不仅编写代码，还能有选择性地"模拟"解释器，通过生成"detect_sarcasm(string)"及其他无法执行代码行的预期输出，仍可能得出有效解决方案。在本工作中，我们提出Chain of Code（CoC），一种简单但效果显著的扩展方法，可提升LM的代码驱动推理能力。其核心思想是鼓励LM将程序中的语义子任务格式化为灵活伪代码，使解释器能够显式捕获未定义行为，并交由LM（作为"LMulator"）进行模拟执行。实验表明，Chain of Code在多个基准测试中均优于思维链及其他基线方法；在BIG-Bench Hard上，Chain of Code达到84%的准确率，相较思维链提升12%。CoC在大规模与小规模模型上均展现良好扩展性，并通过"用代码思考"拓展了LM可正确回答的推理问题范围。项目网页：https://chain-of-code.github.io。