Despite recent progress made by large language models in code generation, they still struggle with programs that meet complex requirements. Recent work utilizes plan-and-solve decomposition to decrease the complexity and leverage self-tests to refine the generated program. Yet, planning deep-inside requirements in advance can be challenging, and the tests need to be accurate to accomplish self-improvement. To this end, we propose FunCoder, a code generation framework incorporating the divide-and-conquer strategy with functional consensus. Specifically, FunCoder recursively branches off sub-functions as smaller goals during code generation, represented by a tree hierarchy. These sub-functions are then composited to attain more complex objectives. Additionally, we designate functions via a consensus formed by identifying similarities in program behavior, mitigating error propagation. FunCoder outperforms state-of-the-art methods by +9.8% on average in HumanEval, MBPP, xCodeEval and MATH with GPT-3.5 and GPT-4. Moreover, our method demonstrates superiority on smaller models: With FunCoder, StableCode-3b surpasses GPT-3.5 by +18.6% and achieves 97.7% of GPT-4's performance on HumanEval. Further analysis reveals that our proposed dynamic function decomposition is capable of handling complex requirements, and the functional consensus prevails over self-testing in correctness evaluation.
翻译:尽管大型语言模型在代码生成方面取得了最新进展,它们对于满足复杂需求的程序仍然力有未逮。近期研究利用规划-求解分解来降低复杂性,并借助自测试来精炼生成的程序。然而,预先深度规划需求具有挑战性,且测试需要足够精确才能实现自我改进。为此,我们提出了FunCoder,一个融合分而治之策略与函数共识的代码生成框架。具体而言,FunCoder在代码生成过程中递归地派生出作为更小子目标的子函数,并以树形层次结构表示。这些子函数随后被组合以实现更复杂的目标。此外,我们通过识别程序行为中的相似性形成共识来指定函数,从而减轻错误传播。在HumanEval、MBPP、xCodeEval和MATH基准测试中,FunCoder结合GPT-3.5与GPT-4平均优于现有最优方法+9.8%。此外,我们的方法在较小模型上也展现出优势:借助FunCoder,StableCode-3b在HumanEval上超越GPT-3.5达+18.6%,并达到GPT-4性能的97.7%。进一步分析表明,我们提出的动态函数分解能够处理复杂需求,且函数共识在正确性评估方面优于自测试。