Code completion, a highly valuable topic in the software development domain, has been increasingly promoted for use by recent advances in large language models (LLMs). To date, visible LLM-based code completion frameworks such as GitHub Copilot and GPT are trained using deep learning over vast quantities of unstructured text and open source code. As the paramount component and the cornerstone in daily programming tasks, code completion has largely boosted professionals' efficiency in building real-world software systems. In contrast to this flourishing market, we find that code completion systems often output suspicious results, and to date, an automated testing and enhancement framework for code completion systems is not available. This research proposes CCTEST, a framework to test and repair code completion systems in blackbox settings. CCTEST features a set of novel mutation strategies, namely program structure-correlated (PSC) mutations, to generate mutated code completion inputs. Then, it detects inconsistent outputs, representing possibly erroneous cases, from all the completed code cases. Moreover, CCTEST repairs the code completion outputs by selecting the output that mostly reflects the "average" appearance of all output cases, as the final output of the code completion systems. We detected a total of 33,540 inputs (with a true positive rate of 86%) that can trigger erroneous cases from eight popular LLM-based code completion systems. With repairing, we show that the accuracy of code completion systems is notably increased by 40% and 67% with respect to BLEU score and Levenshtein edit similarity.
翻译:代码补全作为软件开发领域极具价值的研究方向,近年来得益于大语言模型的进步得以广泛应用。当前基于大语言模型的知名代码补全框架(如GitHub Copilot和GPT)均通过深度学习技术,在大量非结构化文本与开源代码上进行训练。作为日常编程任务的核心组件与基石,代码补全已显著提升了专业人员构建实际软件系统的效率。然而在蓬勃发展的市场背后,我们发现代码补全系统常输出可疑结果,且目前尚缺乏针对此类系统的自动化测试与增强框架。本研究提出CCTEST,一种在黑盒环境下测试与修复代码补全系统的框架。CCTEST包含一组创新性变异策略——程序结构相关变异,用于生成变异后的代码补全输入。随后,框架从所有补全代码案例中检测不一致输出,这些输出代表潜在的错误案例。此外,CCTEST通过选择最能反映所有输出案例"平均"特征的结果作为代码补全系统的最终输出,实现对补全结果的修复。我们在8个主流基于大语言模型的代码补全系统中,共检测到33,540个可触发错误案例的输入样本(真阳性率86%)。经过修复,代码补全系统的精确度在BLEU分数和Levenshtein编辑相似度指标上分别提升40%与67%。