Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In this paper, we propose a novel interactive workflow TiCoder for guided intent clarification (i.e., partial formalization) through tests to support the generation of more accurate code suggestions. Through a mixed methods user study with 15 programmers, we present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy. We find that participants using the proposed workflow are significantly more likely to correctly evaluate AI generated code, and report significantly less task-induced cognitive load. Furthermore, we test the potential of the workflow at scale with four different state-of-the-art LLMs on two python datasets, using an idealized proxy for a user feedback. We observe an average absolute improvement of 45.97% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions, in addition to the automatic generation of accompanying unit tests.
翻译:大型语言模型(LLM)通过从非正式自然语言(NL)意图生成自然代码,在自动化编码的重要方面展现出巨大潜力。然而,由于自然语言具有非正式性,难以直接验证生成代码是否正确满足用户意图。本文提出一种新颖的交互式工作流TiCoder,通过测试引导意图澄清(即部分形式化),以支持生成更准确的代码建议。通过对15名程序员开展混合方法用户研究,我们对该工作流提升代码生成准确性的效果进行了实证评估。研究发现,使用该工作流的参与者能显著提高对AI生成代码的正确评估概率,并反馈任务引发的认知负荷显著降低。此外,我们通过理想化用户反馈代理,在两个Python数据集上使用四种不同的前沿LLM对工作流的规模化潜力进行了测试。结果显示:在5轮用户交互内,所有LLM在两个数据集上的pass@1代码生成准确率平均绝对提升45.97%,同时还能自动生成配套的单元测试。