TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models

Software testing is a crucial phase in the software life cycle, helping identify potential risks and reduce maintenance costs. With the advancement of Large Language Models (LLMs), researchers have proposed an increasing number of LLM-based software testing techniques, particularly in the area of test case generation. Despite the growing interest, limited efforts have been made to thoroughly evaluate the actual capabilities of LLMs in this task. In this paper, we introduce TestBench, a benchmark for class-level LLM-based test case generation. We construct a dataset of 108 Java programs from 9 real-world, large-scale projects on GitHub, each representing a different thematic domain. We then design three distinct types of prompts based on context descriptions, including self-contained context, full context, and simple context. Besides, we propose a fine-grained evaluation framework that considers five aspects of test cases: syntactic correctness, compilation correctness, test correctness, code coverage rate, and defect detection rate. Furthermore, we propose a heuristic algorithm to repair erroneous test cases generated by LLMs. We evaluate CodeLlama-13b, GPT-3.5, and GPT-4 on the TestBench, and our experimental results indicate that larger models demonstrate a greater ability to effectively utilize contextual information, thus generating higher-quality test cases. Smaller models may struggle with the noise introduced by the extensive information contained within the full context. However, when using the simplified version, namely the simple context, which is derived from the full context via abstract syntax tree analysis, the performance of these models improves significantly. Our analysis highlights the current progress and pinpoints future directions to further enhance the effectiveness of models by handling contextual information for test case generation.

翻译：软件测试是软件生命周期中的关键阶段，有助于识别潜在风险并降低维护成本。随着大语言模型（LLMs）的发展，研究人员提出了越来越多基于LLM的软件测试技术，尤其是在测试用例生成领域。尽管关注度日益增长，但针对LLM在此任务中实际能力的全面评估仍较为有限。本文介绍了TestBench，一个用于类级别基于LLM的测试用例生成的基准。我们从GitHub上9个真实世界的大型项目中构建了一个包含108个Java程序的数据集，每个项目代表不同的主题领域。随后，我们基于上下文描述设计了三种不同类型的提示，包括自包含上下文、完整上下文和简单上下文。此外，我们提出了一个细粒度的评估框架，从五个维度考量测试用例：语法正确性、编译正确性、测试正确性、代码覆盖率和缺陷检测率。进一步地，我们提出了一种启发式算法来修复LLM生成的错误测试用例。我们在TestBench上评估了CodeLlama-13b、GPT-3.5和GPT-4，实验结果表明，更大规模的模型展现出更强的有效利用上下文信息的能力，从而生成更高质量的测试用例。较小模型可能难以处理完整上下文所包含的大量信息带来的噪声。然而，当使用简化版本——即通过抽象语法树分析从完整上下文衍生出的简单上下文时，这些模型的性能显著提升。我们的分析揭示了当前进展，并指出了通过处理上下文信息以进一步提升测试用例生成模型有效性的未来方向。