The Job Shop Scheduling Problem (JSSP) is a complex combinatorial optimization problem. While online Reinforcement Learning (RL) has shown promise by quickly finding acceptable solutions for JSSP, it faces key limitations: it requires extensive training interactions from scratch leading to sample inefficiency, cannot leverage existing high-quality solutions, and often yields suboptimal results compared to traditional methods like Constraint Programming (CP). We introduce Offline Reinforcement Learning for Learning to Dispatch (Offline-LD), which addresses these limitations by learning from previously generated solutions. Our approach is motivated by scenarios where historical scheduling data and expert solutions are available, although our current evaluation focuses on benchmark problems. Offline-LD adapts two CQL-based Q-learning methods (mQRDQN and discrete mSAC) for maskable action spaces, introduces a novel entropy bonus modification for discrete SAC, and exploits reward normalization through preprocessing. Our experiments demonstrate that Offline-LD outperforms online RL on both generated and benchmark instances. Notably, by introducing noise into the expert dataset, we achieve similar or better results than those obtained from the expert dataset, suggesting that a more diverse training set is preferable because it contains counterfactual information.
翻译:作业车间调度问题(JSSP)是一类复杂的组合优化问题。尽管在线强化学习(RL)已展现出通过快速寻找可接受解以应对JSSP的潜力,但其仍面临关键局限:需要从零开始的大量训练交互导致样本效率低下、无法利用现有高质量解,且与传统方法(如约束规划CP)相比常产生次优结果。本文提出用于分派学习的离线强化学习方法(Offline-LD),通过从先前生成的解中学习以应对这些局限。我们的方法受到存在历史调度数据与专家解的实际场景启发,尽管当前评估聚焦于基准问题。Offline-LD针对可屏蔽动作空间适配了两种基于CQL的Q学习方法(mQRDQN与离散mSAC),为离散SAC引入了新颖的熵奖励修正机制,并通过预处理实现奖励归一化。实验表明,Offline-LD在生成实例与基准实例上均优于在线强化学习。值得注意的是,通过在专家数据集中引入噪声,我们获得了与专家数据集相当或更优的结果,这表明包含反事实信息的多样化训练集更具优势。