Test Suite Minimization (TSM) reduces the size of test suites while preserving their fault detection capability. In black-box TSM, reduction is performed without relying on production-code instrumentation. While several black-box TSM approaches have explored metrics like test logs or test similarity, these often suffer from scalability and efficiency issues. Recently, change history has been explored as a lightweight and scalable indicator for guiding black-box TSM. However, existing approaches treat historical modifications uniformly, ignoring the temporal dynamics of software evolution where recently modified code tends to be more fault-prone. To address this limitation, we introduce temporal modeling into black-box TSM and propose Temporal Risk-driven Test Suite Minimization (TRTM). TRTM extracts modification history from version-control metadata and applies exponential temporal attenuation to weight changes based on recency, producing time-weighted class-level risk scores that reflect fault-proneness. Next, it determines dependencies between test cases and production classes by constructing static call graphs derived solely from test code, preserving the black-box setting. The risk scores of the classes exercised by each test case are then aggregated using statistical measures such as Average and Geometric Mean to compute a risk score for the test case. Finally, test cases with the highest risk scores are selected to construct the reduced suite. Evaluation on a large dataset containing 14 projects with 631 versions shows that TRTM consistently outperforms the state-of-the-art baseline, achieving a mean Accuracy of 0.72 (vs. 0.66) and Fault Detection Rate (FDR) of 0.75 (vs. 0.69), while also reducing execution time.
翻译:测试用例集约简(TSM)可在保留故障检测能力的前提下缩减测试套件规模。黑盒TSM在无需对生产代码进行插桩的情况下实现缩减。尽管已有多种基于测试日志或测试相似度等指标的黑盒TSM方法,但这些方法常面临可扩展性与效率瓶颈。近年研究表明,变更历史可作为轻量级可扩展指标指导黑盒TSM。然而现有方法将历史修改视为均匀分布,忽略了软件演化中近期修改代码更易产生故障的时序动态性。针对该局限,我们将时序建模引入黑盒TSM,提出时序风险驱动的测试用例集约简方法(TRTM)。TRTM从版本控制元数据中提取修改历史,通过指数时序衰减对变更按新旧程度加权,生成反映故障倾向性的时间加权类级风险评分。进而通过仅分析测试代码的静态调用图构建测试用例与生产类之间的依赖关系,保持黑盒设定。采用均值与几何平均数等统计指标聚合每个测试用例覆盖类的风险评分,从而计算该测试用例的风险值。最终选取风险评分最高的测试用例组成精简套件。在包含14个工程项目共631个版本的大规模数据集上的评估表明,TRTM持续优于当前最优基线方法,平均精度达0.72(对比0.66),故障检测率(FDR)达0.75(对比0.69),同时执行时间也显著降低。