Learned construction heuristics for scheduling problems have become increasingly competitive with established solvers and heuristics in recent years. In particular, significant improvements have been observed in solution approaches using deep reinforcement learning (DRL). While much attention has been paid to the design of network architectures and training algorithms to achieve state-of-the-art results, little research has investigated the optimal use of trained DRL agents during inference. Our work is based on the hypothesis that, similar to search algorithms, the utilization of trained DRL agents should be dependent on the acceptable computational budget. We propose a simple yet effective parameterization, called $\delta$-sampling that manipulates the trained action vector to bias agent behavior towards exploration or exploitation during solution construction. By following this approach, we can achieve a more comprehensive coverage of the search space while still generating an acceptable number of solutions. In addition, we propose an algorithm for obtaining the optimal parameterization for such a given number of solutions and any given trained agent. Experiments extending existing training protocols for job shop scheduling problems with our inference method validate our hypothesis and result in the expected improvements of the generated solutions.
翻译:近年来,用于调度问题的学习型构造启发式方法在竞争中逐渐赶上已建立的求解器和启发式方法。特别是,使用深度强化学习(DRL)的求解方法取得了显著改进。尽管大量研究关注网络架构和训练算法的设计以取得最优结果,但关于如何最优地使用已训练的DRL智能体进行推理的研究却很少。我们的工作基于一个假设:与搜索算法类似,已训练的DRL智能体的使用应取决于可接受的计算预算。我们提出一种简单而有效的参数化方法,称为δ-采样,该方法通过调整已训练的动作向量,在解构造过程中偏向智能体的探索或利用行为。通过这种方法,我们能够在生成可接受数量的解的同时,更全面地覆盖搜索空间。此外,我们提出了一种算法,用于在给定解数量及任何已训练智能体的情况下,获取最优的参数化配置。通过将我们的推理方法扩展到现有的作业车间调度问题训练协议中,实验验证了我们的假设,并实现了生成解质量的预期提升。