Combinatorial Optimization underpins many real-world applications and yet, designing performant algorithms to solve these complex, typically NP-hard, problems remains a significant research challenge. Reinforcement Learning (RL) provides a versatile framework for designing heuristics across a broad spectrum of problem domains. However, despite notable progress, RL has not yet supplanted industrial solvers as the go-to solution. Current approaches emphasize pre-training heuristics that construct solutions but often rely on search procedures with limited variance, such as stochastically sampling numerous solutions from a single policy or employing computationally expensive fine-tuning of the policy on individual problem instances. Building on the intuition that performant search at inference time should be anticipated during pre-training, we propose COMPASS, a novel RL approach that parameterizes a distribution of diverse and specialized policies conditioned on a continuous latent space. We evaluate COMPASS across three canonical problems - Travelling Salesman, Capacitated Vehicle Routing, and Job-Shop Scheduling - and demonstrate that our search strategy (i) outperforms state-of-the-art approaches on 11 standard benchmarking tasks and (ii) generalizes better, surpassing all other approaches on a set of 18 procedurally transformed instance distributions.
翻译:组合优化是许多实际应用的基础,然而,设计高性能算法来解决这些复杂且通常为NP难的问题仍然是一个重大的研究挑战。强化学习为在广泛问题领域中设计启发式算法提供了一个通用框架。然而,尽管取得了显著进展,强化学习尚未取代工业求解器成为首选解决方案。当前方法侧重于预训练用于构建解决方案的启发式算法,但通常依赖于方差有限的搜索过程,例如从单一策略中随机采样大量解,或对单个问题实例进行计算成本高昂的策略微调。基于"在推理时的高性能搜索应在预训练期间被预见"这一直觉,我们提出了COMPASS,这是一种新颖的强化学习方法,它参数化了一个以连续潜在空间为条件的、多样化和专业化策略的分布。我们在三个经典问题——旅行商问题、带容量约束的车辆路径问题和作业车间调度问题——上评估了COMPASS,并证明我们的搜索策略(i)在11项标准基准测试任务上优于最先进的方法,并且(ii)具有更好的泛化能力,在一组由18个经过程序化变换的实例分布上超越了所有其他方法。