Intermediate reasoning or acting steps have successfully improved large language models (LLMs) for handling various downstream natural language processing (NLP) tasks. When applying LLMs for code generation, recent works mainly focus on directing the models to articulate intermediate natural-language reasoning steps, as in chain-of-thought (CoT) prompting, and then output code with the natural language or other structured intermediate steps. However, such output is not suitable for code translation or generation tasks since the standard CoT has different logical structures and forms of expression with the code. In this work, we introduce the universal code (UniCode) as the intermediate representation. It is a description of algorithm steps using a mix of conventions of programming languages, such as assignment operator, conditional operator, and loop. Hence, we collect an instruction dataset UniCoder-Instruct to train our model UniCoder on multi-task learning objectives. UniCoder-Instruct comprises natural-language questions, code solutions, and the corresponding universal code. The alignment between the intermediate universal code representation and the final code solution significantly improves the quality of the generated code. The experimental results demonstrate that UniCoder with the universal code significantly outperforms the previous prompting methods by a large margin, showcasing the effectiveness of the structural clues in pseudo-code.
翻译:中间推理或行动步骤已成功提升大语言模型处理各类下游自然语言处理任务的能力。在将大语言模型应用于代码生成时,近期研究主要集中于引导模型阐述中间的自然语言推理步骤,如思维链提示,随后输出带有自然语言或其他结构化中间步骤的代码。然而,此类输出并不适用于代码翻译或生成任务,因为标准思维链与代码在逻辑结构和表达形式上存在差异。本工作引入通用代码作为中间表示。它是一种采用编程语言混合约定(如赋值运算符、条件运算符和循环)描述的算法步骤。为此,我们收集了指令数据集UniCoder-Instruct,以在多任务学习目标上训练模型UniCoder。UniCoder-Instruct包含自然语言问题、代码解决方案及对应的通用代码。中间通用代码表示与最终代码解决方案之间的对齐显著提升了生成代码的质量。实验结果表明,采用通用代码的UniCoder大幅超越先前的提示方法,充分证明了伪代码中结构线索的有效性。