Large language models of code (Code-LLMs) have recently brought tremendous advances to code completion, a fundamental feature of programming assistance and code intelligence. However, most existing works ignore the possible presence of bugs in the code context for generation, which are inevitable in software development. Therefore, we introduce and study the buggy-code completion problem, inspired by the realistic scenario of real-time code suggestion where the code context contains potential bugs -- anti-patterns that can become bugs in the completed program. To systematically study the task, we introduce two datasets: one with synthetic bugs derived from semantics-altering operator changes (buggy-HumanEval) and one with realistic bugs derived from user submissions to coding problems (buggy-FixEval). We find that the presence of potential bugs significantly degrades the generation performance of the high-performing Code-LLMs. For instance, the passing rates of CodeGen-2B-mono on test cases of buggy-HumanEval drop more than 50% given a single potential bug in the context. Finally, we investigate several post-hoc methods for mitigating the adverse effect of potential bugs and find that there remains a large gap in post-mitigation performance.
翻译:代码大型语言模型(Code-LLMs)近期为代码补全(编程辅助与代码智能的基础功能)带来了巨大进步。然而,现有研究大多忽略了代码上下文中可能存在的错误——这些错误在软件开发中不可避免。为此,我们受实时代码建议场景启发(该场景下代码上下文包含潜在错误,即可能导致补全程序产生缺陷的反模式),提出并研究了含错误代码补全问题。为系统化研究该任务,我们构建了两个数据集:一个包含由语义变更操作符生成的合成错误(buggy-HumanEval),另一个包含来自用户编程问题提交的真实错误(buggy-FixEval)。研究发现,潜在错误的存在会显著降低高性能Code-LLMs的生成表现。例如,当上下文中存在单个潜在错误时,CodeGen-2B-mono在buggy-HumanEval测试用例上的通过率下降超过50%。最后,我们评估了多种事后缓解方法,发现即使在采取缓解措施后,性能差距依然显著存在。