This work introduces (1) a technique that allows large language models (LLMs) to leverage user-provided code when solving programming tasks and (2) a method to iteratively generate modular sub-functions that can aid future code generation attempts when the initial code generated by the LLM is inadequate. Generating computer programs in general-purpose programming languages like Python poses a challenge for LLMs when instructed to use code provided in the prompt. Code-specific LLMs (e.g., GitHub Copilot, CodeLlama2) can generate code completions in real-time by drawing on all code available in a development environment. However, restricting code-specific LLMs to use only in-context code is not straightforward, as the model is not explicitly instructed to use the user-provided code and users cannot highlight precisely which snippets of code the model should incorporate into its context. Moreover, current systems lack effective recovery methods, forcing users to iteratively re-prompt the model with modified prompts until a sufficient solution is reached. Our method differs from traditional LLM-powered code-generation by constraining code-generation to an explicit function set and enabling recovery from failed attempts through automatically generated sub-functions. When the LLM cannot produce working code, we generate modular sub-functions to aid subsequent attempts at generating functional code. A by-product of our method is a library of reusable sub-functions that can solve related tasks, imitating a software team where efficiency scales with experience. We also introduce a new "half-shot" evaluation paradigm that provides tighter estimates of LLMs' coding abilities compared to traditional zero-shot evaluation. Our proposed evaluation method encourages models to output solutions in a structured format, decreasing syntax errors that can be mistaken for poor coding ability.
翻译:本工作提出:(1)一种使大语言模型(LLMs)能够在解决编程任务时利用用户提供代码的技术,以及(2)一种在LLM生成的初始代码不充分时,通过迭代生成模块化子函数来辅助后续代码生成尝试的方法。当LLM被要求使用提示中提供的代码时,生成通用编程语言(如Python)的计算机程序对其构成挑战。代码专用LLM(例如GitHub Copilot、CodeLlama2)可通过利用开发环境中所有可用代码实时生成代码补全。但将代码专用LLM限制为仅使用上下文内代码并不容易,因为模型未明确被告知需使用用户提供的代码,且用户无法精确指明应纳入其上下文的代码片段。此外,当前系统缺乏有效的恢复方法,迫使用户通过反复修改提示词迭代提示模型,直至获得可行方案。我们的方法与传统基于LLM的代码生成不同:它将代码生成约束在显式函数集合内,并通过自动生成子函数实现失败尝试的恢复。当LLM无法生成可用代码时,我们生成模块化子函数以辅助后续功能性代码生成尝试。本方法的副产品是可复用子函数库——这些函数能够解决相关任务,模仿了随经验积累效率提升的软件团队工作模式。我们还提出新的"半样本"评估范式,相比传统零样本评估能更精确地估算LLM的编码能力。本评估方法鼓励模型以结构化格式输出解决方案,从而减少易被误判为编码能力不足的语法错误。