Machine learning models are widely recognized for their strong performance in forecasting. To keep that performance in streaming data settings, they have to be monitored and frequently re-trained. This can be done with machine learning operations (MLOps) techniques under supervision of an MLOps engineer. However, in digital platform settings where the number of data streams is typically large and unstable, standard monitoring becomes either suboptimal or too labor intensive for the MLOps engineer. As a consequence, companies often fall back on very simple worse performing ML models without monitoring. We solve this problem by adopting a design science approach and introducing a new monitoring framework, the Machine Learning Monitoring Agent (MLMA), that is designed to work at scale for any ML model with reasonable labor cost. A key feature of our framework concerns test-based automated re-training based on a data-adaptive reference loss batch. The MLOps engineer is kept in the loop via key metrics and also acts, pro-actively or retrospectively, to maintain performance of the ML model in the production stage. We conduct a large-scale test at a last-mile delivery platform to empirically validate our monitoring framework.
翻译:机器学习模型因其在预测任务中的卓越性能而获得广泛认可。为在流式数据环境中维持这种性能,必须对模型进行监控并频繁重训练。这一过程可通过机器学习运维(MLOps)技术在MLOps工程师的监督下完成。然而,在数据流数量通常庞大且不稳定的数字平台环境中,标准监控方法对MLOps工程师而言要么效果欠佳,要么人力成本过高。这导致企业往往退而采用性能较差但无需监控的简单机器学习模型。我们通过设计科学研究方法引入新型监控框架——机器学习监控代理(MLMA)来解决此问题,该框架专为以合理人力成本实现任意机器学习模型的规模化监控而设计。本框架的核心特性在于基于数据自适应参考损失批次的测试驱动自动化重训练机制。MLOps工程师通过关键指标保持对系统的监督,并可通过前瞻性或追溯性干预来维持生产阶段机器学习模型的性能。我们在某末端配送平台进行了大规模测试,以实证方式验证了该监控框架的有效性。