Offline Dynamic Inventory and Pricing Strategy: Addressing Censored and Dependent Demand

In this paper, we study the offline sequential feature-based pricing and inventory control problem where the current demand depends on the past demand levels and any demand exceeding the available inventory is lost. Our goal is to leverage the offline dataset, consisting of past prices, ordering quantities, inventory levels, covariates, and censored sales levels, to estimate the optimal pricing and inventory control policy that maximizes long-term profit. While the underlying dynamic without censoring can be modeled by Markov decision process (MDP), the primary obstacle arises from the observed process where demand censoring is present, resulting in missing profit information, the failure of the Markov property, and a non-stationary optimal policy. To overcome these challenges, we first approximate the optimal policy by solving a high-order MDP characterized by the number of consecutive censoring instances, which ultimately boils down to solving a specialized Bellman equation tailored for this problem. Inspired by offline reinforcement learning and survival analysis, we propose two novel data-driven algorithms for solving these Bellman equations and, thus, estimate the optimal policy. Furthermore, we establish finite-sample regret bounds to validate the effectiveness of these algorithms. Finally, we conduct numerical experiments to demonstrate the efficacy of our algorithms in estimating the optimal policy. To the best of our knowledge, this is the first data-driven approach to learning optimal pricing and inventory control policies in a sequential decision-making environment characterized by censored and dependent demand. The implementations of the proposed algorithms are available at https://github.com/gundemkorel/Inventory_Pricing_Control

翻译：本文研究基于特征的离线序贯定价与库存控制问题，其中当前需求依赖于历史需求水平，且超出可用库存的需求将流失。我们的目标是通过利用由历史价格、订购量、库存水平、协变量及截尾销售水平构成的离线数据集，估计最大化长期利润的最优定价与库存控制策略。尽管无截尾情况下的底层动态可通过马尔可夫决策过程建模，但主要障碍源于观测过程中存在的需求截尾现象，这导致利润信息缺失、马尔可夫性质失效以及最优策略的非平稳性。为克服这些挑战，我们首先通过求解以连续截尾次数为特征的高阶马尔可夫决策过程来近似最优策略，该问题最终归结为求解为此专门设计的贝尔曼方程。受离线强化学习与生存分析的启发，我们提出了两种新颖的数据驱动算法来求解这些贝尔曼方程，从而估计最优策略。此外，我们建立了有限样本遗憾界以验证这些算法的有效性。最后，我们通过数值实验证明了所提算法在估计最优策略方面的效能。据我们所知，这是在以截尾与相依需求为特征的序贯决策环境中，首个通过数据驱动方法学习最优定价与库存控制策略的研究。所提算法的实现代码可在 https://github.com/gundemkorel/Inventory_Pricing_Control 获取。