Supervised machine learning often encounters concept drift, where the data distribution changes over time, degrading model performance. Existing drift detection methods focus on identifying these shifts but often overlook the challenge of acquiring labeled data for model retraining after a shift occurs. We present the Strategy for Drift Sampling (SUDS), a novel method that selects homogeneous samples for retraining using existing drift detection algorithms, thereby enhancing model adaptability to evolving data. SUDS seamlessly integrates with current drift detection techniques. We also introduce the Harmonized Annotated Data Accuracy Metric (HADAM), a metric that evaluates classifier performance in relation to the quantity of annotated data required to achieve the stated performance, thereby taking into account the difficulty of acquiring labeled data. Our contributions are twofold: SUDS combines drift detection with strategic sampling to improve the retraining process, and HADAM provides a metric that balances classifier performance with the amount of labeled data, ensuring efficient resource utilization. Empirical results demonstrate the efficacy of SUDS in optimizing labeled data use in dynamic environments, significantly improving the performance of machine learning applications in real-world scenarios. Our code is open source and available at https://github.com/cfellicious/SUDS/
翻译:监督式机器学习常面临概念漂移问题,即数据分布随时间变化导致模型性能下降。现有的漂移检测方法主要关注识别分布变化,但往往忽略漂移发生后获取标注数据以进行模型重训练的挑战。本文提出漂移采样策略(SUDS),这是一种利用现有漂移检测算法选择同质样本进行重训练的新方法,从而增强模型对演化数据的适应能力。SUDS可与当前漂移检测技术无缝集成。我们还提出了协调标注数据准确度度量(HADAM),该度量通过评估分类器性能与达到特定性能所需标注数据量之间的关系,将获取标注数据的难度纳入考量。我们的贡献包括两方面:SUDS将漂移检测与策略性采样相结合以改进重训练过程;HADAM提供了一种平衡分类器性能与标注数据量的度量标准,确保资源的高效利用。实证结果表明,SUDS在动态环境中能有效优化标注数据使用,显著提升机器学习应用在实际场景中的性能。我们的代码已开源,可通过 https://github.com/cfellicious/SUDS/ 获取。