Code provides a general syntactic structure to build complex programs and perform precise computations when paired with a code interpreter - we hypothesize that language models (LMs) can leverage code-writing to improve Chain of Thought reasoning not only for logic and arithmetic tasks, but also for semantic ones (and in particular, those that are a mix of both). For example, consider prompting an LM to write code that counts the number of times it detects sarcasm in an essay: the LM may struggle to write an implementation for "detect_sarcasm(string)" that can be executed by the interpreter (handling the edge cases would be insurmountable). However, LMs may still produce a valid solution if they not only write code, but also selectively "emulate" the interpreter by generating the expected output of "detect_sarcasm(string)". In this work, we propose Chain of Code (CoC), a simple yet surprisingly effective extension that improves LM code-driven reasoning. The key idea is to encourage LMs to format semantic sub-tasks in a program as flexible pseudocode that the interpreter can explicitly catch undefined behaviors and hand off to simulate with an LM (as an "LMulator"). Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over Chain of Thought. In a nutshell, CoC broadens the scope of reasoning questions that LMs can answer by "thinking in code".
翻译:代码与代码解释器结合时,提供了一种通用的语法结构来构建复杂程序并执行精确计算——我们假设语言模型(LMs)可以利用代码编写来改进思维链推理,不仅适用于逻辑和算术任务,也适用于语义任务(尤其是混合型任务)。例如,考虑提示一个语言模型编写代码来统计一篇文章中检测到的讽刺次数:语言模型可能难以编写出可由解释器执行的“detect_sarcasm(string)”实现(处理边缘情况将是不可逾越的)。然而,如果语言模型不仅编写代码,还能通过生成“detect_sarcasm(string)”的预期输出,有选择地“模拟”解释器,它们仍可能产生有效的解决方案。在本工作中,我们提出代码链(CoC),这是一种简单却出奇有效的扩展方法,旨在改进语言模型的代码驱动推理。其核心思想是鼓励语言模型将程序中的语义子任务格式化为灵活的伪代码,使得解释器能够明确捕获未定义行为,并将其移交给语言模型进行模拟(作为“LMulator”)。实验表明,代码链在多种基准测试中优于思维链及其他基线方法;在BIG-Bench Hard上,代码链达到了84%的准确率,较思维链提升了12%。简而言之,CoC通过“以代码思考”拓宽了语言模型能够回答的推理问题的范围。