Inventory management offers unique opportunities for reliably evaluating and applying deep reinforcement learning (DRL). Rather than evaluate DRL algorithms by comparing against one another or against human experts, we can compare to the optimum itself in several problem classes with hidden structure. Our DRL methods consistently recover near-optimal policies in such settings, despite being applied with up to 600-dimensional raw state vectors. In others, they can vastly outperform problem-specific heuristics. To reliably apply DRL, we leverage two insights. First, one can directly optimize the hindsight performance of any policy using stochastic gradient descent. This uses (i) an ability to backtest any policy's performance on a subsample of historical demand observations, and (ii) the differentiability of the total cost incurred on any subsample with respect to policy parameters. Second, we propose a natural neural network architecture to address problems with weak (or aggregate) coupling constraints between locations in an inventory network. This architecture employs weight duplication for ``sibling'' locations in the network, and state summarization. We justify this architecture through an asymptotic guarantee, and empirically affirm its value in handling large-scale problems.
翻译:库存管理为可靠评估及应用深度强化学习提供了独特机遇。相较于通过对比算法间或与人类专家的表现来评估DRL算法,我们可在若干具有隐藏结构的问题类别中直接对标理论最优解。我们的DRL方法即便在处理高达600维的原始状态向量时,仍能持续获得接近最优策略;而在其他场景中,这些方法可显著超越领域专用启发式算法。为实现DRL的可靠应用,我们基于两个关键洞见:其一,通过随机梯度下降可直接优化任意策略的后见性能,这依赖于(i)在历史需求观测子样本上回测任意策略绩效的能力,以及(ii)各子样本产生的总成本对策略参数的可微性;其二,我们提出一种天然适配的神经网络架构,用于处理库存网络中节点间存在弱耦合(或聚合耦合)约束的问题。该架构通过对网络内"兄弟"节点实施权重共享与状态汇总技术,实现了结构仿射。我们通过渐近性保证理论论证该架构合理性,并通过实证确认其在处理大规模问题中的显著价值。