Chain of Code: Reasoning with a Language Model-Augmented Code Emulator

Code provides a general syntactic structure to build complex programs and perform precise computations when paired with a code interpreter - we hypothesize that language models (LMs) can leverage code-writing to improve Chain of Thought reasoning not only for logic and arithmetic tasks, but also for semantic ones (and in particular, those that are a mix of both). For example, consider prompting an LM to write code that counts the number of times it detects sarcasm in an essay: the LM may struggle to write an implementation for "detect_sarcasm(string)" that can be executed by the interpreter (handling the edge cases would be insurmountable). However, LMs may still produce a valid solution if they not only write code, but also selectively "emulate" the interpreter by generating the expected output of "detect_sarcasm(string)". In this work, we propose Chain of Code (CoC), a simple yet surprisingly effective extension that improves LM code-driven reasoning. The key idea is to encourage LMs to format semantic sub-tasks in a program as flexible pseudocode that the interpreter can explicitly catch undefined behaviors and hand off to simulate with an LM (as an "LMulator"). Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over Chain of Thought. In a nutshell, CoC broadens the scope of reasoning questions that LMs can answer by "thinking in code".

翻译：代码在与代码解释器配对时，为构建复杂程序和执行精确计算提供了一种通用的语法结构——我们假设语言模型（LMs）能够利用代码编写来改进思维链推理，不仅适用于逻辑和算术任务，也适用于语义任务（尤其是那些混合了逻辑与语义的任务）。例如，考虑提示一个语言模型编写代码来统计其在文章中检测到讽刺的次数：语言模型可能难以编写出可由解释器执行的“detect_sarcasm(string)”实现（处理边缘情况将是不可逾越的挑战）。然而，如果语言模型不仅编写代码，还能通过生成“detect_sarcasm(string)”的预期输出，有选择地“模拟”解释器的行为，它们仍可能产生有效的解决方案。在本工作中，我们提出了代码链（CoC），这是一种简单却出奇有效的扩展方法，旨在提升语言模型的代码驱动推理能力。其核心思想是鼓励语言模型将程序中的语义子任务格式化为灵活的伪代码，使得解释器能够明确捕获未定义的行为，并交由语言模型进行模拟（作为一个“LMulator”）。实验表明，代码链在多种基准测试中均优于思维链及其他基线方法；在BIG-Bench Hard上，代码链达到了84%的准确率，较思维链提升了12%。简而言之，CoC通过“用代码思考”拓宽了语言模型能够回答的推理问题的范围。