A Simulation Environment and Reinforcement Learning Method for Waste Reduction

In retail (e.g., grocery stores, apparel shops, online retailers), inventory managers have to balance short-term risk (no items to sell) with long-term-risk (over ordering leading to product waste). This balancing task is made especially hard due to the lack of information about future customer purchases. In this paper, we study the problem of restocking a grocery store's inventory with perishable items over time, from a distributional point of view. The objective is to maximize sales while minimizing waste, with uncertainty about the actual consumption by costumers. This problem is of a high relevance today, given the growing demand for food and the impact of food waste on the environment, the economy, and purchasing power. We frame inventory restocking as a new reinforcement learning task that exhibits stochastic behavior conditioned on the agent's actions, making the environment partially observable. We make two main contributions. First, we introduce a new reinforcement learning environment, RetaiL, based on real grocery store data and expert knowledge. This environment is highly stochastic, and presents a unique challenge for reinforcement learning practitioners. We show that uncertainty about the future behavior of the environment is not handled well by classical supply chain algorithms, and that distributional approaches are a good way to account for the uncertainty. Second, we introduce GTDQN, a distributional reinforcement learning algorithm that learns a generalized Tukey Lambda distribution over the reward space. GTDQN provides a strong baseline for our environment. It outperforms other distributional reinforcement learning approaches in this partially observable setting, in both overall reward and reduction of generated waste.

翻译：在零售业（例如：杂货店、服装店、在线零售商）中，库存管理者需平衡短期风险（无货可卖）与长期风险（过量订购导致产品浪费）。由于缺乏未来顾客购买信息，这项平衡任务尤其困难。本文从分布视角研究随时间推移对杂货店的易腐品库存进行补货的问题。目标是在顾客实际消费量不确定的情况下，最大化销售额同时最小化浪费。鉴于日益增长的食品需求以及食物浪费对环境、经济和购买力的影响，该问题在当前具有高度相关性。我们将库存补货问题构建为一种新的强化学习任务，该任务表现出以智能体行为为条件的随机行为特性，使得环境部分可观测。我们做出两项主要贡献。首先，基于真实杂货店数据和专家知识，我们引入了一种新的强化学习环境RetaiL。该环境具有高度随机性，为强化学习从业者带来了独特的挑战。我们表明，经典供应链算法无法良好处理环境未来行为的不确定性，而分布方法是应对不确定性的有效途径。其次，我们提出GTDQN，一种分布强化学习算法，该算法学习奖励空间上的广义Tukey Lambda分布。GTDQN为我们的环境提供了强大的基线性能。在这个部分可观测场景中，它在总奖励和减少废品产生量两方面均优于其他分布强化学习方法。