Domain-Independent Dynamic Programming (DIDP) is a state-space search paradigm based on dynamic programming for combinatorial optimization. In its current implementation, DIDP guides the search using user-defined dual bounds. Reinforcement learning (RL) is increasingly being applied to combinatorial optimization problems and shares several key structures with DP, being represented by the Bellman equation and state-based transition systems. We propose using reinforcement learning to obtain a heuristic function to guide the search in DIDP. We develop two RL-based guidance approaches: value-based guidance using Deep Q-Networks and policy-based guidance using Proximal Policy Optimization. Our experiments indicate that RL-based guidance significantly outperforms standard DIDP and problem-specific greedy heuristics with the same number of node expansions. Further, despite longer node evaluation times, RL guidance achieves better run-time performance than standard DIDP on three of four benchmark domains.
翻译:领域无关动态规划(DIDP)是一种基于动态规划的、用于组合优化的状态空间搜索范式。在现有实现中,DIDP使用用户定义的对偶界来指导搜索。强化学习(RL)正日益应用于组合优化问题,并且与动态规划共享若干关键结构,两者均以贝尔曼方程和基于状态的转移系统为表征。我们提出使用强化学习来获取启发式函数,以指导DIDP中的搜索。我们开发了两种基于强化学习的指导方法:使用深度Q网络的价值导向方法,以及使用近端策略优化的策略导向方法。实验表明,在相同节点扩展次数下,基于强化学习的指导方法显著优于标准DIDP和特定问题的贪心启发式方法。此外,尽管节点评估时间更长,但在四个基准领域中的三个上,强化学习指导仍实现了优于标准DIDP的运行时间性能。