CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules

Large Language Models (LLMs) have already become quite proficient at solving simpler programming tasks like those in HumanEval or MBPP benchmarks. However, solving more complex and competitive programming tasks is still quite challenging for these models - possibly due to their tendency to generate solutions as monolithic code blocks instead of decomposing them into logical sub-tasks and sub-modules. On the other hand, experienced programmers instinctively write modularized code with abstraction for solving complex tasks, often reusing previously developed modules. To address this gap, we propose CodeChain, a novel framework for inference that elicits modularized code generation through a chain of self-revisions, each being guided by some representative sub-modules generated in previous iterations. Concretely, CodeChain first instructs the LLM to generate modularized codes through chain-of-thought prompting. Then it applies a chain of self-revisions by iterating the two steps: 1) extracting and clustering the generated sub-modules and selecting the cluster representatives as the more generic and re-usable implementations, and 2) augmenting the original chain-of-thought prompt with these selected module-implementations and instructing the LLM to re-generate new modularized solutions. We find that by naturally encouraging the LLM to reuse the previously developed and verified sub-modules, CodeChain can significantly boost both modularity as well as correctness of the generated solutions, achieving relative pass@1 improvements of 35% on APPS and 76% on CodeContests. It is shown to be effective on both OpenAI LLMs as well as open-sourced LLMs like WizardCoder. We also conduct comprehensive ablation studies with different methods of prompting, number of clusters, model sizes, program qualities, etc., to provide useful insights that underpin CodeChain's success.

翻译：大型语言模型（LLMs）在解决如HumanEval或MBPP基准测试中的简单编程任务时已展现出较高熟练度。然而，面对更复杂的竞争性编程任务，这类模型仍面临显著挑战——其症结可能在于模型倾向于生成整体式代码块，而非将问题分解为逻辑子任务与子模块。反观经验丰富的程序员，他们在解决复杂任务时会本能地编写带有抽象层的模块化代码，并常复用先前开发的模块。为弥合这一差距，我们提出CodeChain——一种新颖的推理框架，通过自我修正链引导模块化代码生成，每次修正均由先前迭代中生成的代表性子模块指导。具体而言，CodeChain首先通过思维链提示诱导LLM生成模块化代码，随后通过迭代两个步骤形成自我修正链：1）提取并聚类生成的子模块，选择聚类代表作为更通用可复用的实现；2）用选定的模块实现增强原始思维链提示，指导LLM重新生成新的模块化解法。我们发现，通过自然激励LLM复用先前开发并验证的子模块，CodeChain能显著提升生成代码的模块化程度与正确性，在APPS和CodeContests基准上分别实现pass@1指标35%和76%的相对提升。该方法对OpenAI系列及开源模型（如WizardCoder）均有效。我们通过不同提示方法、聚类数量、模型规模、程序质量等维度的消融实验，揭示了CodeChain成功的关键机理。