COCO: Testing Code Generation Systems via Concretized Instructions

Code generation systems have been extensively developed in recent years to generate source code based on natural language instructions. However, despite their advancements, these systems still face robustness issues where even slightly different instructions can result in significantly different code semantics. Robustness is critical for code generation systems, as it can have significant impacts on software development, software quality, and trust in the generated code. Although existing testing techniques for general text-to-text software can detect some robustness issues, they are limited in effectiveness due to ignoring the characteristics of code generation systems. In this work, we propose a novel technique COCO to test the robustness of code generation systems. It exploits the usage scenario of code generation systems to make the original programming instruction more concrete by incorporating features known to be contained in the original code. A robust system should maintain code semantics for the concretized instruction, and COCO detects robustness inconsistencies when it does not. We evaluated COCO on eight advanced code generation systems, including commercial tools such as Copilot and ChatGPT, using two widely-used datasets. Our results demonstrate the effectiveness of COCO in testing the robustness of code generation systems, outperforming two techniques adopted from general text-to-text software testing by 466.66% and 104.02%, respectively. Furthermore, concretized instructions generated by COCO can help reduce robustness inconsistencies by 18.35% to 53.91% through fine-tuning.

翻译：近年来，代码生成系统被广泛开发，以根据自然语言指令生成源代码。然而，尽管取得了进展，这些系统仍面临鲁棒性问题：即使指令略有不同，也可能导致代码语义出现显著差异。鲁棒性对代码生成系统至关重要，因为它会显著影响软件开发、软件质量以及对生成代码的信任。尽管现有的通用文本到文本软件测试技术能检测部分鲁棒性问题，但由于忽略了代码生成系统的特性，其有效性有限。为此，我们提出了一种新颖的测试技术COCO，用于检验代码生成系统的鲁棒性。该技术利用代码生成系统的使用场景，通过融入原始代码已知包含的特征来使原编程指令更加具体化。鲁棒性系统应能针对具体化指令保持代码语义一致，而COCO在系统未做到时检测鲁棒性不一致性。我们在八个先进代码生成系统（包括Copilot和ChatGPT等商业工具）上，使用两个广泛采用的数据集进行了评估。结果表明，COCO在测试代码生成系统鲁棒性方面具有有效性，其性能分别比从通用文本到文本软件测试中采用的两种技术高出466.66%和104.02%。此外，通过微调，COCO生成的具体化指令可将鲁棒性不一致性降低18.35%至53.91%。