Automated Test Suite Enhancement Using Large Language Models with Few-shot Prompting

Unit testing is essential for verifying the functional correctness of code modules (e.g., classes, methods), but manually writing unit tests is often labor-intensive and time-consuming. Unit tests generated by tools that employ traditional approaches, such as search-based software testing (SBST), lack readability, naturalness, and practical usability. LLMs have recently provided promising results and become integral to developers' daily practices. Consequently, software repositories now include a mix of human-written tests, LLM-generated tests, and those from tools employing traditional approaches such as SBST. While LLMs' zero-shot capabilities have been widely studied, their few-shot learning potential for unit test generation remains underexplored. Few-shot prompting enables LLMs to learn from examples in the prompt, and automatically retrieving such examples could enhance test suites. This paper empirically investigates how few-shot prompting with different test artifact sources, comprising human, SBST, or LLM, affects the quality of LLM-generated unit tests as program comprehension artifacts and their contribution to improving existing test suites by evaluating not only correctness and coverage but also readability, cognitive complexity, and maintainability in hybrid human-AI codebases. We conducted experiments on HumanEval and ClassEval datasets using GPT-4o, which is integrated into GitHub Copilot and widely used among developers. We also assessed retrieval-based methods for selecting relevant examples. Our results show that LLMs can generate high-quality tests via few-shot prompting, with human-written examples producing the best coverage and correctness. Additionally, selecting examples based on the combined similarity of problem description and code consistently yields the most effective few-shot prompts.

翻译：单元测试对于验证代码模块（如类、方法）的功能正确性至关重要，但手动编写单元测试通常劳动密集且耗时。采用传统方法（如基于搜索的软件测试，SBST）的工具生成的单元测试缺乏可读性、自然性和实际可用性。大型语言模型（LLMs）近期已展现出有前景的结果，并成为开发者日常实践的重要组成部分。因此，软件仓库现在包含人工编写的测试、LLM生成的测试以及采用传统方法（如SBST）的工具生成的测试。尽管LLM的零样本能力已被广泛研究，但其在单元测试生成方面的少样本学习潜力仍未得到充分探索。少样本提示使LLM能够从提示中的示例学习，而自动检索此类示例可以增强测试套件。本文通过实证研究，探讨了使用不同测试工件来源（包括人工、SBST或LLM）的少样本提示如何影响LLM生成的单元测试作为程序理解工件的质量，以及它们对改进现有测试套件的贡献。评估不仅包括正确性和覆盖率，还涵盖了混合人机代码库中的可读性、认知复杂性和可维护性。我们使用集成于GitHub Copilot并在开发者中广泛使用的GPT-4o，在HumanEval和ClassEval数据集上进行了实验。我们还评估了基于检索的方法来选择相关示例。我们的结果表明，LLM能够通过少样本提示生成高质量的测试，其中人工编写的示例在覆盖率和正确性方面表现最佳。此外，基于问题描述和代码的联合相似性选择示例，能持续产生最有效的少样本提示。