Large language models of code (Code-LLMs) have recently brought tremendous advances to code completion, a fundamental feature of programming assistance and code intelligence. However, most existing works ignore the possible presence of bugs in the code context for generation, which are inevitable in software development. Therefore, we introduce and study the buggy-code completion problem, inspired by the realistic scenario of real-time code suggestion where the code context contains potential bugs -- anti-patterns that can become bugs in the completed program. To systematically study the task, we introduce two datasets: one with synthetic bugs derived from semantics-altering operator changes (buggy-HumanEval) and one with realistic bugs derived from user submissions to coding problems (buggy-FixEval). We find that the presence of potential bugs significantly degrades the generation performance of the high-performing Code-LLMs. For instance, the passing rates of CODEGEN-2B-MONO on test cases of buggy-HumanEval drop more than 50% given a single potential bug in the context. Finally, we investigate several post-hoc methods for mitigating the adverse effect of potential bugs and find that there remains a significant gap in post-mitigation performance.
翻译:代码大语言模型(Code-LLMs)近期为代码补全这一编程辅助与代码智能的核心功能带来了巨大进步。然而,现有研究大多忽视了生成代码上下文中可能存在的缺陷——这在软件开发中不可避免。为此,我们提出并研究了"含缺陷代码补全"问题,其灵感来源于实时代码建议的真实场景:代码上下文包含潜在缺陷(即可能演变为补全程序缺陷的反模式)。为系统研究该任务,我们构建了两个数据集:基于语义变更运算符变化合成缺陷的buggy-HumanEval,以及基于用户提交编程问题中真实缺陷的buggy-FixEval。研究发现,潜在缺陷的存在会显著降低高性能Code-LLMs的生成性能。例如,当上下文中出现单个潜在缺陷时,CODEGEN-2B-MONO在buggy-HumanEval测试用例上的通过率下降超过50%。最后,我们探索了多种减轻潜在缺陷不利影响的后处理方法,发现即使经过缓解处理,模型性能仍存在显著差距。