This work introduces (1) a technique that allows large language models (LLMs) to leverage user-provided code when solving programming tasks and (2) a method to iteratively generate modular sub-functions that can aid future code generation attempts when the initial code generated by the LLM is inadequate. Generating computer programs in general-purpose programming languages like Python poses a challenge for LLMs when instructed to use code provided in the prompt. Code-specific LLMs (e.g., GitHub Copilot, CodeLlama2) can generate code completions in real-time by drawing on all code available in a development environment. However, restricting code-specific LLMs to use only in-context code is not straightforward, as the model is not explicitly instructed to use the user-provided code and users cannot highlight precisely which snippets of code the model should incorporate into its context. Moreover, current systems lack effective recovery methods, forcing users to iteratively re-prompt the model with modified prompts until a sufficient solution is reached. Our method differs from traditional LLM-powered code-generation by constraining code-generation to an explicit function set and enabling recovery from failed attempts through automatically generated sub-functions. When the LLM cannot produce working code, we generate modular sub-functions to aid subsequent attempts at generating functional code. A by-product of our method is a library of reusable sub-functions that can solve related tasks, imitating a software team where efficiency scales with experience. We also introduce a new "half-shot" evaluation paradigm that provides tighter estimates of LLMs' coding abilities compared to traditional zero-shot evaluation. Our proposed evaluation method encourages models to output solutions in a structured format, decreasing syntax errors that can be mistaken for poor coding ability.
翻译:本研究提出了(1)一种让大语言模型(LLMs)在解决编程任务时能够利用用户提供的代码的技术,以及(2)一种当LLM初始生成的代码不充分时,能迭代生成模块化子函数以辅助后续代码生成尝试的方法。在Python等通用编程语言中生成计算机程序时,若要求LLM使用提示中提供的代码会带来挑战。专用代码模型(如GitHub Copilot、CodeLlama2)可通过利用开发环境中的所有代码实时生成代码补全。然而,要将专用代码模型限制为仅使用上下文中的代码并不简单,因为模型并未被明确指示使用用户提供的代码,且用户无法精确高亮哪些代码片段应被纳入模型上下文。此外,现有系统缺乏有效的恢复方法,迫使用户反复调整提示词修改提示内容,直到获得足够完善的解决方案。我们的方法与传统基于LLM的代码生成不同,通过将代码生成约束至显式函数集,并借助自动生成的子函数实现从失败尝试中恢复。当LLM无法生成可用代码时,我们会生成模块化子函数以辅助后续功能代码的生成尝试。该方法的副产品是可复用子函数库,能解决相关任务,模仿了效率随经验积累而提升的软件团队模式。我们还提出了一种新的"半次评估"范式(half-shot evaluation),相较于传统零次评估能更精确地估算LLM的编程能力。该评估方法鼓励模型以结构化格式输出解决方案,从而减少可能被误判为编码能力不足的语法错误。