We present a proximal policy optimization (PPO) agent trained through curriculum learning (CL) principles and meticulous reward engineering to optimize a real-world high-throughput waste sorting facility. Our work addresses the challenge of effectively balancing the competing objectives of operational safety, volume optimization, and minimizing resource usage. A vanilla agent trained from scratch on these multiple criteria fails to solve the problem due to its inherent complexities. This problem is particularly difficult due to the environment's extremely delayed rewards with long time horizons and class (or action) imbalance, with important actions being infrequent in the optimal policy. This forces the agent to anticipate long-term action consequences and prioritize rare but rewarding behaviours, creating a non-trivial reinforcement learning task. Our five-stage CL approach tackles these challenges by gradually increasing the complexity of the environmental dynamics during policy transfer while simultaneously refining the reward mechanism. This iterative and adaptable process enables the agent to learn a desired optimal policy. Results demonstrate that our approach significantly improves inference-time safety, achieving near-zero safety violations in addition to enhancing waste sorting plant efficiency.
翻译:本研究提出了一种通过课程学习原则与精细奖励工程设计训练的近端策略优化智能体,用于优化现实世界中的高吞吐量废物分拣设施。我们的工作致力于解决在操作安全性、吞吐量优化和资源消耗最小化等多个竞争目标之间实现有效平衡的挑战。由于问题固有的复杂性,直接在多重标准下从零开始训练的原始智能体无法解决该问题。该问题尤其困难的原因在于:环境奖励具有极长的延迟时间跨度,且存在类别(或动作)不平衡现象——在最优策略中关键动作的出现频率极低。这迫使智能体必须预测动作的长期后果,并优先考虑稀缺但高回报的行为,从而构成了一个非平凡的强化学习任务。我们提出的五阶段课程学习方法通过策略迁移过程中逐步增加环境动态的复杂性,同时完善奖励机制,有效应对了这些挑战。这种迭代且自适应的过程使智能体能够学习到期望的最优策略。实验结果表明,我们的方法显著提升了推理阶段的安全性,在实现近乎零安全违规的同时,进一步提高了废物分拣工厂的运作效率。