Large Language Models (LLMs) (e.g., ChatGPT) have shown impressive performance in code generation. LLMs take prompts as inputs, and Chain-of-Thought (CoT) prompting is the state-of-the-art prompting technique. CoT prompting asks LLMs first to generate CoTs (i.e., intermediate natural language reasoning steps) and then output the code. However, CoT prompting is designed for natural language generation and has low accuracy in code generation. In this paper, we propose Structured CoTs (SCoTs) and present a novel prompting technique for code generation, named SCoT prompting. Our motivation is source code contains rich structural information and any code can be composed of three program structures (i.e., sequence, branch, and loop structures). Intuitively, structured intermediate reasoning steps make for structured source code. Thus, we ask LLMs to use program structures to build CoTs, obtaining SCoTs. Then, LLMs generate the final code based on SCoTs. Compared to CoT prompting, SCoT prompting explicitly constrains LLMs to think about how to solve requirements from the view of source code and further the performance of LLMs in code generation. We apply SCoT prompting to two LLMs (i.e., ChatGPT and Codex) and evaluate it on three benchmarks (i.e., HumanEval, MBPP, and MBCPP). (1) SCoT prompting outperforms the state-of-the-art baseline - CoT prompting by up to 13.79% in Pass@1. (2) Human evaluation shows human developers prefer programs from SCoT prompting. (3) SCoT prompting is robust to examples and achieves substantial improvements.
翻译:大型语言模型(LLMs,如ChatGPT)在代码生成方面展现出令人瞩目的性能。LLMs以提示作为输入,而思维链(CoT)提示是当前最先进的提示技术。CoT提示要求LLMs首先生成CoT(即中间自然语言推理步骤),然后输出代码。然而,CoT提示专为自然语言生成设计,在代码生成中准确率较低。本文提出结构化CoT(SCoT),并介绍一种新颖的代码生成提示技术,即SCoT提示。我们的动机在于:源代码包含丰富的结构信息,且任何代码均可由三种程序结构(即顺序结构、分支结构和循环结构)组成。直观来看,结构化的中间推理步骤有助于生成结构化的源代码。因此,我们要求LLMs利用程序结构构建CoT,从而获得SCoT。随后,LLMs基于SCoT生成最终代码。与CoT提示相比,SCoT提示明确约束LLMs从源代码视角思考如何满足需求,进而提升LLMs在代码生成中的性能。我们将SCoT提示应用于两个LLMs(即ChatGPT和Codex),并在三个基准测试(即HumanEval、MBPP和MBCPP)上进行评估:(1)SCoT提示在Pass@1指标上比最先进的基线——CoT提示最高提升13.79%;(2)人工评估表明,开发人员更偏好SCoT提示生成的程序;(3)SCoT提示对示例具有鲁棒性,并能实现显著改进。