LTM: Scalable and Black-box Similarity-based Test Suite Minimization based on Language Models

Test suites tend to grow when software evolves, making it often infeasible to execute all test cases with the allocated testing budgets, especially for large software systems. Test suite minimization (TSM) is employed to improve the efficiency of software testing by removing redundant test cases, thus reducing testing time and resources, while maintaining the fault detection capability of the test suite. Most existing TSM approaches rely on code coverage (white-box) or model-based features, which are not always available to test engineers. Recent TSM approaches that rely only on test code (black-box) have been proposed, such as ATM and FAST-R. To address the scalability, we propose LTM (Language model-based Test suite Minimization), a novel, scalable, and black-box similarity-based TSM approach based on large language models (LLMs), which is the first application of LLMs in the context of TSM. To support similarity measurement for test code embeddings, we investigate five pre-trained language models: CodeBERT, GraphCodeBERT, UniXcoder, StarEncoder, and CodeLlama, on which we compute two similarity measures: Cosine Similarity and Euclidean Distance. Our goal is to find similarity measures that are not only computationally more efficient but can also better guide a Genetic Algorithm (GA) to search for optimal minimized test suites, thus reducing the overall search time. Experimental results show that the best configuration of LTM (UniXcoder/Cosine) outperforms ATM in three aspects: (a) achieving a slightly greater saving rate of testing time (41.72% versus 41.02%, on average); (b) attaining a significantly higher fault detection rate (0.84 versus 0.81, on average); and, most importantly, (c) minimizing test suites nearly five times faster on average, with higher gains for larger test suites and systems, thus achieving much higher scalability.

翻译：随着软件演化，测试套件规模趋于增长，在既定的测试预算下通常无法执行所有测试用例，尤其对于大型软件系统。测试套件最小化通过移除冗余测试用例来提高软件测试效率，从而减少测试时间和资源消耗，同时保持测试套件的缺陷检测能力。现有的大多数测试套件最小化方法依赖于代码覆盖率（白盒）或基于模型的特征，但这些信息并非总能被测试工程师获取。近期已出现仅依赖测试代码（黑盒）的测试套件最小化方法，例如ATM和FAST-R。为解决可扩展性问题，本文提出LTM（基于语言模型的测试套件最小化），这是一种新颖、可扩展且基于相似性的黑盒测试套件最小化方法，其基于大语言模型实现，也是大语言模型在测试套件最小化领域的首次应用。为支持测试代码嵌入向量的相似性度量，我们研究了五种预训练语言模型：CodeBERT、GraphCodeBERT、UniXcoder、StarEncoder和CodeLlama，并在此基础上计算两种相似性度量：余弦相似度与欧氏距离。我们的目标是寻找不仅计算效率更高，同时能更好地指导遗传算法搜索最优最小化测试套件的相似性度量方法，从而减少整体搜索时间。实验结果表明，LTM的最佳配置（UniXcoder/余弦相似度）在三个方面优于ATM：（a）平均测试时间节省率略高（41.72%对比41.02%）；（b）平均缺陷检测率显著更高（0.84对比0.81）；最重要的是（c）测试套件最小化速度平均提升近五倍，且测试套件与系统规模越大增益越显著，从而实现了更高的可扩展性。