Mitigating ML Model Decay in Continuous Integration with Data Drift Detection: An Empirical Study

Background: Machine Learning (ML) methods are being increasingly used for automating different activities, e.g., Test Case Prioritization (TCP), of Continuous Integration (CI). However, ML models need frequent retraining as a result of changes in the CI environment, more commonly known as data drift. Also, continuously retraining ML models consume a lot of time and effort. Hence, there is an urgent need of identifying and evaluating suitable approaches that can help in reducing the retraining efforts and time for ML models used for TCP in CI environments. Aims: This study aims to investigate the performance of using data drift detection techniques for automatically detecting the retraining points for ML models for TCP in CI environments without requiring detailed knowledge of the software projects. Method: We employed the Hellinger distance to identify changes in both the values and distribution of input data and leveraged these changes as retraining points for the ML model. We evaluated the efficacy of this method on multiple datasets and compared the APFDc and NAPFD evaluation metrics against models that were regularly retrained, with careful consideration of the statistical methods. Results: Our experimental evaluation of the Hellinger distance-based method demonstrated its efficacy and efficiency in detecting retraining points and reducing the associated costs. However, the performance of this method may vary depending on the dataset. Conclusions: Our findings suggest that data drift detection methods can assist in identifying retraining points for ML models in CI environments, while significantly reducing the required retraining time. These methods can be helpful for practitioners who lack specialized knowledge of software projects, enabling them to maintain ML model accuracy.

翻译：背景：机器学习方法正越来越多地被用于自动化持续集成中的不同活动（例如测试用例优先级排序）。然而，由于CI环境的变化（通常称为数据漂移），ML模型需要频繁重新训练。同时，持续重新训练ML模型会消耗大量时间和精力。因此，迫切需要识别和评估合适的方案，以帮助减少CI环境中用于TCP的ML模型的重新训练工作和时间。目的：本研究旨在探究使用数据漂移检测技术自动检测CI环境中TCP的ML模型重新训练点的性能，而无需了解软件项目的详细信息。方法：我们采用Hellinger距离来识别输入数据值和分布的变化，并将这些变化作为ML模型的重新训练点。我们在多个数据集上评估了该方法的有效性，并仔细考虑统计方法，将APFDc和NAPFD评估指标与定期重新训练的模型进行了比较。结果：我们对基于Hellinger距离方法的实验评估证明了其在检测重新训练点和降低相关成本方面的有效性和效率。然而，该方法的性能可能因数据集而异。结论：我们的研究结果表明，数据漂移检测方法可以帮助识别CI环境中ML模型的重新训练点，同时显著减少所需的重新训练时间。这些方法可以帮助缺乏软件项目专业知识从业者保持ML模型精度。