In this paper, we propose a novel reinforcement learning algorithm for inventory management of newly launched products with no historical demand information. The algorithm follows the classic Dyna-$Q$ structure, balancing the model-free and model-based approaches, while accelerating the training process of Dyna-$Q$ and mitigating the model discrepancy generated by the model-based feedback. Based on the idea of transfer learning, warm-start information from the demand data of existing similar products can be incorporated into the algorithm to further stabilize the early-stage training and reduce the variance of the estimated optimal policy. Our approach is validated through a case study of bakery inventory management with real data. The adjusted Dyna-$Q$ shows up to a 23.7\% reduction in average daily cost compared with $Q$-learning, and up to a 77.5\% reduction in training time within the same horizon compared with classic Dyna-$Q$. By using transfer learning, it can be found that the adjusted Dyna-$Q$ has the lowest total cost, lowest variance in total cost, and relatively low shortage percentages among all the benchmarking algorithms under a 30-day testing.
翻译:本文针对缺乏历史需求信息的新上市产品库存管理问题,提出了一种新颖的强化学习算法。该算法遵循经典Dyna-$Q$框架,在无模型与基于模型的方法之间取得平衡,同时加速Dyna-$Q$的训练过程并缓解基于模型反馈所产生的模型偏差。基于迁移学习思想,算法可引入来自现有相似产品需求数据的预热信息,从而进一步稳定早期训练阶段并降低最优策略估计值的方差。我们通过基于真实数据的烘焙产品库存管理案例验证了所提方法的有效性。改进型Dyna-$Q$相较于$Q$-学习实现了日均成本最高降低23.7%,在相同训练周期内较经典Dyna-$Q$减少高达77.5%的训练时间。通过迁移学习的应用可发现,在30天测试周期内,改进型Dyna-$Q$在所有基准算法中具有最低的总成本、最低的总成本方差以及相对较低的缺货率。