Since the 1990s, considerable empirical work has been carried out to train statistical models, such as neural networks (NNs), as learned heuristics for combinatorial optimization (CO) problems. When successful, such an approach eliminates the need for experts to design heuristics per problem type. Due to their structure, many hard CO problems are amenable to treatment through reinforcement learning (RL). Indeed, we find a wealth of literature training NNs using value-based, policy gradient, or actor-critic approaches, with promising results, both in terms of empirical optimality gaps and inference runtimes. Nevertheless, there has been a paucity of theoretical work undergirding the use of RL for CO problems. To this end, we introduce a unified framework to model CO problems through Markov decision processes (MDPs) and solve them using RL techniques. We provide easy-to-test assumptions under which CO problems can be formulated as equivalent undiscounted MDPs that provide optimal solutions to the original CO problems. Moreover, we establish conditions under which value-based RL techniques converge to approximate solutions of the CO problem with a guarantee on the associated optimality gap. Our convergence analysis provides: (1) a sufficient rate of increase in batch size and projected gradient descent steps at each RL iteration; (2) the resulting optimality gap in terms of problem parameters and targeted RL accuracy; and (3) the importance of a choice of state-space embedding. Together, our analysis illuminates the success (and limitations) of the celebrated deep Q-learning algorithm in this problem context.
翻译:自20世纪90年代以来,已有大量实证研究致力于训练统计模型(如神经网络)作为组合优化问题的学习型启发式方法。若取得成功,该方法可免除专家针对各类问题设计启发式的需求。由于其结构特性,许多困难组合优化问题适合通过强化学习进行处理。事实上,我们发现大量文献采用基于价值、策略梯度或行动者-评论家方法训练神经网络,在经验最优性差距和推理运行时间方面均展现出有前景的结果。然而,支撑强化学习在组合优化中应用的理论研究仍显不足。为此,我们提出了一个统一框架,通过马尔可夫决策过程对组合优化问题建模,并运用强化学习技术求解。我们建立了易于验证的假设条件,在此条件下组合优化问题可表述为等价的无折扣MDP,并为原始组合优化问题提供最优解。此外,我们确立了基于价值强化学习技术收敛至组合优化问题近似解的条件,并给出相关最优性差距的保证。我们的收敛分析提供:(1) 每次强化学习迭代中批量大小和投影梯度下降步长的充分增长率;(2) 以问题参数和目标强化学习精度表示的最优性差距结果;(3) 状态空间嵌入选择的重要性。综合分析阐明了著名深度Q学习算法在此问题背景下的成功(与局限性)。