With Reinforcement Learning (RL) for inventory management (IM) being a nascent field of research, approaches tend to be limited to simple, linear environments with implementations that are minor modifications of off-the-shelf RL algorithms. Scaling these simplistic environments to a real-world supply chain comes with a few challenges such as: minimizing the computational requirements of the environment, specifying agent configurations that are representative of dynamics at real world stores and warehouses, and specifying a reward framework that encourages desirable behavior across the whole supply chain. In this work, we present a system with a custom GPU-parallelized environment that consists of one warehouse and multiple stores, a novel architecture for agent-environment dynamics incorporating enhanced state and action spaces, and a shared reward specification that seeks to optimize for a large retailer's supply chain needs. Each vertex in the supply chain graph is an independent agent that, based on its own inventory, able to place replenishment orders to the vertex upstream. The warehouse agent, aside from placing orders from the supplier, has the special property of also being able to constrain replenishment to stores downstream, which results in it learning an additional allocation sub-policy. We achieve a system that outperforms standard inventory control policies such as a base-stock policy and other RL-based specifications for 1 product, and lay out a future direction of work for multiple products.
翻译:将强化学习应用于库存管理作为新兴研究领域,现有方法往往局限于简单线性环境,且实现方式仅对现成强化学习算法进行微小修改。将此类简化环境扩展到真实供应链面临多项挑战:最小化环境计算需求、构建能反映真实门店与仓库动态特性的智能体配置、以及设计能激励全供应链理想行为的奖励框架。本文提出一种包含定制化GPU并行环境(由单个仓库与多个门店构成)的系统,采用融合增强状态与动作空间的智能体-环境动态交互新型架构,并设计面向大型零售商供应链需求的共享奖励机制。供应链图中的每个节点均为独立智能体,可基于自身库存向上游节点发起补货订单。仓库智能体除向供应商下单外,还具备约束下游门店补货的特殊能力,由此习得额外的分配子策略。本系统在单产品场景下实现了优于基库存策略及其他强化学习方法的库存控制性能,并为多产品场景指明了未来研究方向。