In this paper we address the problem of learning and backtesting inventory control policies in the presence of general arrival dynamics -- which we term as a quantity-over-time arrivals model (QOT). We also allow for order quantities to be modified as a post-processing step to meet vendor constraints such as order minimum and batch size constraints -- a common practice in real supply chains. To the best of our knowledge this is the first work to handle either arbitrary arrival dynamics or an arbitrary downstream post-processing of order quantities. Building upon recent work (Madeka et al., 2022) we similarly formulate the periodic review inventory control problem as an exogenous decision process, where most of the state is outside the control of the agent. Madeka et al. (2022) show how to construct a simulator that replays historic data to solve this class of problem. In our case, we incorporate a deep generative model for the arrivals process as part of the history replay. By formulating the problem as an exogenous decision process, we can apply results from Madeka et al. (2022) to obtain a reduction to supervised learning. Finally, we show via simulation studies that this approach yields statistically significant improvements in profitability over production baselines. Using data from an ongoing real-world A/B test, we show that Gen-QOT generalizes well to off-policy data.
翻译:本文研究了在一般到达动态(我们称之为随时间变化到达量模型QOT)下学习及回测库存控制策略的问题。同时,我们允许将订单量作为后处理步骤进行调整,以满足供应商约束(如最小订单量和批量大小限制)——这是真实供应链中的常见做法。据我们所知,这是首个能够处理任意到达动态或任意下游订单量后处理的研究。基于近期工作(Madeka等,2022),我们同样将定期盘点库存控制问题建模为外生决策过程,其中大部分状态不受智能体控制。Madeka等(2022)展示了如何构建基于历史数据回放以解决此类问题的模拟器。在我们的方法中,我们将深度到达过程生成模型作为历史回放的一部分纳入框架。通过将问题表述为外生决策过程,我们可应用Madeka等(2022)的结论将其转化为监督学习问题。最后,仿真研究表明该方法相较于生产基线能带来统计显著的利润提升。基于真实线上A/B测试数据,我们证实Gen-QOT对离线策略数据具有良好的泛化能力。