For a complicated algorithm, its implementation by a human programmer usually starts with outlining a rough control flow followed by iterative enrichments, eventually yielding carefully generated syntactic structures and variables in a hierarchy. However, state-of-the-art large language models generate codes in a single pass, without intermediate warm-ups to reflect the structured thought process of "outline-then-detail". Inspired by the recent success of chain-of-thought prompting, we propose ChainCoder, a program synthesis language model that generates Python code progressively, i.e. from coarse to fine in multiple passes. We first decompose source code into layout frame components and accessory components via abstract syntax tree parsing to construct a hierarchical representation. We then reform our prediction target into a multi-pass objective, each pass generates a subsequence, which is concatenated in the hierarchy. Finally, a tailored transformer architecture is leveraged to jointly encode the natural language descriptions and syntactically aligned I/O data samples. Extensive evaluations show that ChainCoder outperforms state-of-the-arts, demonstrating that our progressive generation eases the reasoning procedure and guides the language model to generate higher-quality solutions. Our codes are available at: https://github.com/VITA-Group/ChainCoder.
翻译:对于复杂的算法,人类程序员通常先勾勒出粗略的控制流程,再通过迭代完善,最终生成层次分明的语法结构和变量。然而,当前最先进的大型语言模型以单次生成的方式编写代码,缺乏中间预热过程来体现“先大纲后细节”的结构化思维。受近期思维链提示成功的启发,我们提出了ChainCoder——一种通过多轮渐进方式(由粗到精)生成Python代码的程序合成语言模型。首先,通过抽象语法树解析将源代码分解为布局框架组件和附属组件,构建层次化表示;然后,将预测目标重构为多轮目标,每轮生成子序列并按层次拼接;最后,采用定制的Transformer架构联合编码自然语言描述和语法对齐的输入/输出数据样本。大量评估表明,ChainCoder优于现有最优方法,证明渐进式生成能简化推理过程,并引导语言模型生成更高质量的解决方案。我们的代码已开源:https://github.com/VITA-Group/ChainCoder。