Optimizing the consolidation process in container-based fulfillment centers requires trading off competing objectives such as processing speed, resource usage, and space utilization while adhering to a range of real-world operational constraints. This process involves moving items between containers via a combination of human and robotic workstations to free up space for inbound inventory and increase container utilization. We formulate this problem as a large-scale Multi-Objective Reinforcement Learning (MORL) task with high-dimensional state spaces and dynamic system behavior. Our method builds on recent theoretical advances in solving constrained RL problems via best-response and no-regret dynamics in zero-sum games, enabling principled minimax policy learning. Policy evaluation on realistic warehouse simulations shows that our approach effectively trades off objectives, and we empirically observe that it learns a single policy that simultaneously satisfies all constraints, even if this is not theoretically guaranteed. We further introduce a theoretical framework to handle the problem of error cancellation, where time-averaged solutions display oscillatory behavior. This method returns a single iterate whose Lagrangian value is close to the minimax value of the game. These results demonstrate the promise of MORL in solving complex, high-impact decision-making problems in large-scale industrial systems.
翻译:在基于货箱的履约中心中,优化整合流程需要在处理速度、资源使用率和空间利用率等相互冲突的目标之间进行权衡,同时还需满足一系列现实操作约束。该流程通过人工作业站与机器人工作站的协同,在货箱间移动物品,从而为入库库存释放空间并提升货箱利用率。我们将此问题建模为一个具有高维状态空间和动态系统行为的大规模多目标强化学习任务。我们的方法基于近期通过零和博弈中的最优响应与无悔动态求解约束强化学习问题的理论进展,实现了有理论依据的极小极大策略学习。在真实仓库仿真环境中的策略评估表明,我们的方法能有效权衡各项目标;经验观察发现,即使缺乏理论保证,该方法仍能学习到同时满足所有约束的单一策略。我们进一步提出了处理误差抵消问题的理论框架,该问题会导致时间平均解呈现振荡行为。该方法返回一个单一迭代解,其拉格朗日值接近博弈的极小极大值。这些结果证明了多目标强化学习在解决大规模工业系统中复杂且具有高影响力的决策问题方面的潜力。