Adjustable Robust Reinforcement Learning for Online 3D Bin Packing

Designing effective policies for the online 3D bin packing problem (3D-BPP) has been a long-standing challenge, primarily due to the unpredictable nature of incoming box sequences and stringent physical constraints. While current deep reinforcement learning (DRL) methods for online 3D-BPP have shown promising results in optimizing average performance over an underlying box sequence distribution, they often fail in real-world settings where some worst-case scenarios can materialize. Standard robust DRL algorithms tend to overly prioritize optimizing the worst-case performance at the expense of performance under normal problem instance distribution. To address these issues, we first introduce a permutation-based attacker to investigate the practical robustness of both DRL-based and heuristic methods proposed for solving online 3D-BPP. Then, we propose an adjustable robust reinforcement learning (AR2L) framework that allows efficient adjustment of robustness weights to achieve the desired balance of the policy's performance in average and worst-case environments. Specifically, we formulate the objective function as a weighted sum of expected and worst-case returns, and derive the lower performance bound by relating to the return under a mixture dynamics. To realize this lower bound, we adopt an iterative procedure that searches for the associated mixture dynamics and improves the corresponding policy. We integrate this procedure into two popular robust adversarial algorithms to develop the exact and approximate AR2L algorithms. Experiments demonstrate that AR2L is versatile in the sense that it improves policy robustness while maintaining an acceptable level of performance for the nominal case.

翻译：设计面向在线三维装箱问题（3D-BPP）的有效策略长期面临挑战，主要源于待装箱子序列的不可预测性及严格的物理约束。尽管当前用于在线3D-BPP的深度强化学习（DRL）方法在优化基于潜在箱子序列分布的平均性能方面展现出积极效果，但在真实场景中遭遇极端情况时往往失效。标准鲁棒DRL算法倾向于过度优先优化最差情形性能，却牺牲了正常问题实例分布下的性能表现。为解决这些问题，我们首先引入基于排列的攻击者来探究当前面向在线3D-BPP的DRL方法及启发式方法的实际鲁棒性。随后提出可调鲁棒强化学习（AR2L）框架，该框架能够高效调整鲁棒权重，在策略平均性能与最差环境性能之间实现所需平衡。具体而言，我们构建以期望回报与最差情形回报加权和为目标函数的目标函数，并通过关联混合动力学下的回报推导出性能下界。为实现该下界，我们采用迭代流程搜索对应的混合动力学并改进相应策略。我们将该流程集成到两种主流鲁棒对抗算法中，开发出精确与近似AR2L算法。实验表明，AR2L在提升策略鲁棒性的同时，能保持标称情形下可接受的性能水平，具备显著的通用性。