Neural combinatorial optimization (NCO) trains autoregressive policies to solve routing problems. The standard training algorithm, REINFORCE with a rollout baseline, requires maintaining and periodically updating a frozen copy of the policy for variance reduction. This baseline introduces a structural vulnerability: on harder instances, a poor baseline produces noisy gradient estimates that can destabilize training. We evaluate Group Relative Policy Optimization (GRPO), an algorithm from large language model alignment that eliminates the baseline entirely by normalizing advantages within groups of sampled trajectories. In a controlled comparison of five RL algorithms on TSP and CVRP benchmarks within the RL4CO framework, we find that: (i) GRPO avoids the training collapse observed with REINFORCE on TSP-100, where performance degrades from cost 9.8 to 52.1 immediately after the warmup phase and does not recover under extended training; (ii) at matched gradient updates, GRPO achieves solution quality within 2% of POMO, a strong AM-based multi-start baseline, while requiring no external baseline; and (iii) P3O, a pairwise preference algorithm also from the alignment literature, is competitive on TSP but shows higher variability on CVRP. These results identify GRPO as a promising baseline-free alternative for NCO, particularly in settings where baseline-dependent training becomes fragile.
翻译:神经组合优化(NCO)通过训练自回归策略解决路径规划问题。标准训练算法——基于轮换基线的REINFORCE——需要维护并定期更新冻结策略副本以降低方差。这种基线引入了结构性缺陷:在较难实例上,劣质基线会产生噪声梯度估计,可能导致训练失稳。我们评估了源自大语言模型对齐的群体相对策略优化(GRPO)算法,该算法通过归一化采样轨迹组内的优势值完全消除了基线。在RL4CO框架中针对TSP和CVRP基准的五种强化学习算法控制对比实验中,我们发现:(i)GRPO避免了REINFORCE在TSP-100上观察到的训练崩溃——该问题在预热阶段后性能从成本9.8骤降至52.1且无法通过延长训练恢复;(ii)在相同梯度更新次数下,GRPO的求解质量与强基线POMO(基于注意力机制的多起点算法)相差不超过2%,且无需外部基线;(iii)同样来自对齐文献的成对偏好算法P3O在TSP上具有竞争力,但在CVRP上表现出更高变异性。这些结果表明GRPO可作为NCO中前景良好的无基线替代方案,尤其适用于基线依赖型训练易失效的场景。