LTM: Scalable and Black-box Similarity-based Test Suite Minimization based on Language Models

Test suites tend to grow when software evolves, making it often infeasible to execute all test cases with the allocated testing budgets, especially for large software systems. Therefore, test suite minimization (TSM) is employed to improve the efficiency of software testing by removing redundant test cases, thus reducing testing time and resources, while maintaining the fault detection capability of the test suite. Most of the TSM approaches rely on code coverage (white-box) or model-based features, which are not always available for test engineers. Recent TSM approaches that rely only on test code (black-box) have been proposed, such as ATM and FAST-R. To address scalability, we propose LTM (Language model-based Test suite Minimization), a novel, scalable, and black-box similarity-based TSM approach based on large language models (LLMs). To support similarity measurement, we investigated three different pre-trained language models: CodeBERT, GraphCodeBERT, and UniXcoder, to extract embeddings of test code, on which we computed two similarity measures: Cosine Similarity and Euclidean Distance. Our goal is to find similarity measures that are not only computationally more efficient but can also better guide a Genetic Algorithm (GA), thus reducing the overall search time. Experimental results, under a 50% minimization budget, showed that the best configuration of LTM (using UniXcoder with Cosine similarity) outperformed the best two configurations of ATM in three key facets: (a) achieving a greater saving rate of testing time (40.38% versus 38.06%, on average); (b) attaining a significantly higher fault detection rate (0.84 versus 0.81, on average); and, more importantly, (c) minimizing test suites much faster (26.73 minutes versus 72.75 minutes, on average) in terms of both preparation time (up to two orders of magnitude faster) and search time (one order of magnitude faster).

翻译：测试套件在软件演化过程中往往会持续增长，导致在分配的测试预算内执行所有测试用例通常不可行，尤其是对于大型软件系统。为此，采用测试套件最小化（TSM）技术，通过剔除冗余测试用例提升软件测试效率，从而在保持测试套件故障检测能力的同时减少测试时间与资源消耗。大多数TSM方法依赖代码覆盖率（白盒）或基于模型的特征，但这些特征对测试工程师而言并非总是可用。近期出现了仅依赖测试代码（黑盒）的TSM方法，例如ATM和FAST-R。为解决可扩展性问题，我们提出LTM（基于语言模型的测试套件最小化）——一种新颖、可扩展且基于黑盒相似性的TSM方法，其核心基于大型语言模型（LLM）。为支撑相似性度量，我们研究了三种预训练语言模型：CodeBERT、GraphCodeBERT和UniXcoder，用于提取测试代码的嵌入表示，并在此基础上计算两种相似性度量：余弦相似度与欧氏距离。我们的目标是找到不仅计算效率更高，还能更好指导遗传算法（GA）的相似性度量，从而减少整体搜索时间。在50%最小化预算下的实验结果表明，LTM的最佳配置（采用UniXcoder与余弦相似度）在以下三个关键方面优于ATM的两种最优配置：（a）实现更高的测试时间节省率（平均40.38% 对比 38.06%）；（b）达到显著更高的故障检测率（平均0.84 对比 0.81）；更重要的是，（c）在准备时间（最高快两个数量级）和搜索时间（快一个数量级）方面，实现更快的测试套件最小化（平均26.73分钟对比 72.75分钟）。