Data stream forecasts are essential inputs for decision making at digital platforms. Machine learning algorithms are appealing candidates to produce such forecasts. Yet, digital platforms require a large-scale forecast framework that can flexibly respond to sudden performance drops. Re-training ML algorithms at the same speed as new data batches enter is usually computationally too costly. On the other hand, infrequent re-training requires specifying the re-training frequency and typically comes with a severe cost of forecast deterioration. To ensure accurate and stable forecasts, we propose a simple data-driven monitoring procedure to answer the question when the ML algorithm should be re-trained. Instead of investigating instability of the data streams, we test if the incoming streaming forecast loss batch differs from a well-defined reference batch. Using a novel dataset constituting 15-min frequency data streams from an on-demand logistics platform operating in London, we apply the monitoring procedure to popular ML algorithms including random forest, XGBoost and lasso. We show that monitor-based re-training produces accurate forecasts compared to viable benchmarks while preserving computational feasibility. Moreover, the choice of monitoring procedure is more important than the choice of ML algorithm, thereby permitting practitioners to combine the proposed monitoring procedure with one's favorite forecasting algorithm.
翻译:数据流预测是数字平台决策的关键输入。机器学习算法是生成此类预测的理想候选方案。然而,数字平台需要一种能够灵活应对突发性能下降的大规模预测框架。以与新数据批次进入相同的速度重新训练机器学习算法在计算上通常成本过高。另一方面,不频繁的重新训练需要指定重新训练频率,并且通常会带来预测质量严重下降的代价。为确保预测准确且稳定,我们提出了一种简单的数据驱动监测程序,用于回答机器学习算法何时应重新训练的问题。不同于研究数据流的不稳定性,我们检验进入的流式预测损失批次是否与定义明确的参考批次存在差异。利用来自伦敦一家按需物流平台运营的15分钟频率数据流的新颖数据集,我们将该监测程序应用于随机森林、XGBoost和套索等流行的机器学习算法。我们证明,与可行的基准方法相比,基于监测的重新训练能够在保持计算可行性的同时产生准确的预测。此外,监测程序的选择比机器学习算法的选择更为重要,从而使从业者能够将所提出的监测程序与自身偏好的预测算法相结合。