Neural Combinatorial Optimization (NCO) has emerged as a promising approach for NP-hard problems. However, prevailing RL-based methods suffer from low sample efficiency due to sparse rewards and underused solutions. We propose Preference Optimization for Combinatorial Optimization (POCO), a training paradigm that leverages solution preferences via objective values. It introduces: (1) an efficient preference pair construction for better explore and exploit solutions, and (2) a novel loss function that adaptively scales gradients via objective differences, removing reliance on reward models or reference policies. Experiments on Job-Shop Scheduling (JSP), Traveling Salesman (TSP), and Flexible Job-Shop Scheduling (FJSP) show POCO outperforms state-of-the-art neural methods, reducing optimality gaps impressively with efficient inference. POCO is architecture-agnostic, enabling seamless integration with existing NCO models, and establishes preference optimization as a principled framework for combinatorial optimization.
翻译:神经组合优化(NCO)已成为解决NP难问题的一种前景广阔的方法。然而,当前主流的基于强化学习的方法由于奖励稀疏和解决方案利用不足,存在样本效率低下的问题。我们提出用于组合优化的偏好优化(POCO),这是一种通过目标函数值利用解决方案偏好的训练范式。该范式包含:(1)一种高效的偏好对构建方法,以更好地探索和利用解决方案;(2)一种新颖的损失函数,通过目标差异自适应地调整梯度缩放,从而摆脱对奖励模型或参考策略的依赖。在作业车间调度(JSP)、旅行商问题(TSP)和柔性作业车间调度(FJSP)上的实验表明,POCO在高效推理的同时,显著降低了最优性差距,性能优于最先进的神经方法。POCO与架构无关,能够与现有NCO模型无缝集成,并将偏好优化确立为组合优化的一个原则性框架。