Code provides a general syntactic structure to build complex programs and perform precise computations when paired with a code interpreter -- we hypothesize that language models (LMs) can leverage code-writing to improve Chain of Thought reasoning not only for logic and arithmetic tasks, but also for linguistic ones (and in particular, those that are a mix of both). For example, consider prompting an LM to write code that counts the number of times it detects sarcasm in an essay: the LM may struggle to write an implementation for "detect_sarcasm(string)" that can be executed by the interpreter (handling the edge cases would be insurmountable). However, LMs may still produce a valid solution if they are used not only to write the code, but also to selectively "emulate" the interpreter by generating the expected output of "detect_sarcasm(string)" and other lines of code (e.g., that the interpreter could not compile). In this work, we propose Chain of Code (CoT), a simple yet surprisingly effective extension that improves LM code-driven reasoning. The key idea is to encourage LMs to format linguistic sub-tasks in a program as flexible pseudocode that the compiler can explicitly catch undefined behaviors and hand off to simulate with an LM (as an "LMulator"). Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over Chain of Thought. CoT scales well with large and small models alike, and broadens the scope of reasoning questions that LMs can correctly answer by "thinking in code". Project webpage: https://chain-of-code.github.io/.
翻译:代码提供了一种通用的语法结构来构建复杂程序,并在配合代码解释器时执行精确计算。我们假设语言模型(LMs)可以利用代码编写来改进思维链推理,不仅适用于逻辑和算术任务,也适用于语言类任务(尤其是兼具两者的混合任务)。例如,考虑引导语言模型编写代码来统计文章中检测到的讽刺次数:语言模型可能难以实现能被解释器执行的"detect_sarcasm(string)"函数(处理边缘情况将困难重重)。然而,如果语言模型不仅用于编写代码,还通过选择性"模拟"解释器——生成"detect_sarcasm(string)"及其他代码行(例如解释器无法编译的代码)的预期输出——则仍可能得出有效解决方案。在本工作中,我们提出Chain of Code(CoT),这是一种简单却出奇有效的扩展方法,可提升语言模型基于代码的推理能力。其核心思想是鼓励语言模型将程序中的语言子任务格式化为灵活伪代码,使编译器能明确捕捉未定义行为,并将这些行为交由语言模型(作为"LM模拟器")模拟执行。实验表明,Chain of Code在多种基准测试中均优于思维链及其他基线方法;在BIG-Bench Hard上,Chain of Code达到84%的准确率,相较于思维链提升了12%。CoT在大规模和小规模模型上均具有良好的扩展性,并通过"以代码思考"拓宽了语言模型能正确作答的推理问题范围。项目网页:https://chain-of-code.github.io/。