There has been growing interest in applying reinforcement learning (RL) to inventory management, either by optimizing over temporal transitions or by learning directly from full historical demand trajectories. This contrasts sharply with classical data-driven approaches, which first estimate demand distributions from past data and then compute well-structured optimal policies via dynamic programming. This paper considers a hybrid approach that combines trajectory-based RL with policy regularization imposing base-stock and $(s, S) $ structures. We provide generalization guarantees for this combined approach for several well-known classes in a $T$-period dynamic inventory model, using tools from the celebrated Vapnik-Chervonenkis (VC) theory, such as the Pseudo-dimension and Fat-shattering dimension. Our results have implications for regret against the best-in-class policies, and allow for an arbitrary distribution over demand sequences, which makes no assumptions such as independence across time. Surprisingly, we prove that the class of policies defined by $T$ non-stationary base-stock levels exhibits a generalization error that does not grow with $T$, whereas the two-parameter $(s, S)$ policy class has a generalization error growing logarithmically with $T$. Overall, our analysis leverages specific inventory structures within the learning theory framework, and improves sample complexity guarantees even compared to existing results assuming independent demands.
翻译:近年来,将强化学习应用于库存管理的兴趣日益增长,其方法要么通过优化时间转移过程,要么直接从完整的历史需求轨迹中学习。这与经典的数据驱动方法形成鲜明对比,后者首先从过去的数据中估计需求分布,然后通过动态规划计算结构良好的最优策略。本文考虑一种混合方法,将基于轨迹的强化学习与施加基库存和$(s, S)$结构的策略正则化相结合。我们使用著名的Vapnik-Chervonenkis理论中的工具,如伪维度和Fat-shattering维度,为$T$周期动态库存模型中的几个已知策略类提供了这种组合方法的泛化保证。我们的结果对同类最优策略的遗憾具有启示意义,并且允许需求序列上的任意分布,无需做出诸如时间独立性等假设。令人惊讶的是,我们证明了由$T$个非平稳基库存水平定义的策略类表现出不随$T$增长的泛化误差,而双参数$(s, S)$策略类的泛化误差随$T$呈对数增长。总体而言,我们的分析在学习理论框架内利用了特定的库存结构,即使与假设需求独立的现有结果相比,也改进了样本复杂度的保证。