datadriftR: An R Package for Concept Drift Detection in Predictive Models

Predictive models often face performance degradation due to evolving data distributions, a phenomenon known as data drift. Among its forms, concept drift, where the relationship between explanatory variables and the response variable changes, is particularly challenging to detect and adapt to. Traditional drift detection methods often rely on metrics such as accuracy or variable distributions, which may fail to capture subtle but significant conceptual changes. This paper introduces drifter, an R package designed to detect concept drift, and proposes a novel method called Profile Drift Detection (PDD) that enables both drift detection and an enhanced understanding of the cause behind the drift by leveraging an explainable AI tool - Partial Dependence Profiles (PDPs). The PDD method, central to the package, quantifies changes in PDPs through novel metrics, ensuring sensitivity to shifts in the data stream without excessive computational costs. This approach aligns with MLOps practices, emphasizing model monitoring and adaptive retraining in dynamic environments. The experiments across synthetic and real-world datasets demonstrate that PDD outperforms existing methods by maintaining high accuracy while effectively balancing sensitivity and stability. The results highlight its capability to adaptively retrain models in dynamic environments, making it a robust tool for real-time applications. The paper concludes by discussing the advantages, limitations, and future extensions of the package for broader use cases.

翻译：预测模型常因数据分布随时间演变而面临性能退化，这一现象被称为数据漂移。其中，解释变量与响应变量之间关系发生变化的概念漂移尤其难以检测和适应。传统的漂移检测方法通常依赖准确率或变量分布等指标，这些指标可能无法捕捉到细微但重要的概念性变化。本文介绍了drifter——一个专为检测概念漂移设计的R语言包，并提出了一种名为剖面漂移检测（PDD）的新方法。该方法通过利用可解释人工智能工具——部分依赖剖面（PDPs），不仅能实现漂移检测，还能增强对漂移背后原因的理解。作为该包核心的PDD方法通过新颖的度量指标量化PDPs的变化，在保证对数据流变化敏感性的同时避免了过高的计算成本。这一方法符合MLOps实践，强调在动态环境中的模型监控与自适应重训练。在合成数据集和真实数据集上的实验表明，PDD在保持高准确率的同时有效平衡了敏感性与稳定性，其性能优于现有方法。结果突显了该方法在动态环境中自适应重训练模型的能力，使其成为实时应用的强大工具。文章最后讨论了该包在更广泛用例中的优势、局限性与未来扩展方向。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日