Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, when interacting with LLMs, users have no guarantees that the code suggestions produced correctly satisfy the intent they provided. In fact, it is hard to define a notion of correctness since natural language can be ambiguous and lacks a formal semantics. In this paper, we propose the workflow of {\it interactive test-driven code generation}, which leverages lightweight user feedback to (a) formalize the user intent using generated tests that can be useful for debugging, and (b) produce an improved set of code suggestions by pruning and ranking candidate code suggestions. We describe a language-agnostic abstract algorithm and a concrete implementation TiCoder. We perform an automated evaluation of TiCoder on the \emph{MBPP} and \emph{HumanEval} code generation benchmarks. Our results are promising with using the OpenAI Codex LLM: our best algorithm improves the \passk{1} code generation accuracy (in absolute percentages) between $22.49\%$ to $37.71\%$ for MBPP and between $24.79\%$ to $53.98\%$ for HumanEval using between 1 to 5 simulated user queries.
翻译:大型语言模型(LLMs)在通过自然语言(NL)的非正式意图生成自然代码方面展现出自动化编码重要环节的巨大潜力。然而,在与LLMs交互时,用户无法保证生成的代码建议能正确满足其提供的意图。事实上,由于自然语言具有歧义性且缺乏形式语义,我们难以定义"正确性"的概念。本文提出**交互式测试驱动代码生成**工作流,该工作流通过利用轻量级用户反馈:(a)使用生成的测试用例形式化用户意图(这些测试可用于调试),以及(b)通过剪枝与排序候选代码建议生成改进后的代码建议集。我们描述了一种语言无关的抽象算法及其具体实现TiCoder。我们在\emph{MBPP}和\emph{HumanEval}代码生成基准上对TiCoder进行自动化评估。使用OpenAI Codex LLM的实验结果令人振奋:在模拟1至5次用户查询的条件下,我们的最优算法使\passk{1}代码生成准确率(绝对百分比)在MBPP上提升22.49\%至37.71\%,在HumanEval上提升24.79\%至53.98\%。