A First Guess is Rarely the Final Answer: Learning to Search in the Traveling Salesperson Problem

Most neural solvers for the Traveling Salesperson Problem (TSP) are trained to output a single solution, even though practitioners rarely stop there: at test time, they routinely spend extra compute on sampling or post-hoc search. This raises a natural question: can the search procedure itself be learned? Neural improvement methods take this perspective by learning a policy that applies local modifications to a candidate solution, accumulating gains over an improvement trajectory. Yet learned improvement for TSP remains comparatively immature, with existing methods still falling short of robust, scalable performance. We argue that a key reason is design mismatch: many approaches reuse state representations, architectural choices, and training recipes inherited from single-solution methods, rather than being built around the mechanics of local search. This mismatch motivates NICO-TSP (Neural Improvement for Combinatorial Optimization): a 2-opt improvement framework for TSP. NICO-TSP represents the current tour with exactly $n$ edge tokens aligned with the neighborhood operator, scores 2-opt moves directly without tour positional encodings, and trains via a two-stage procedure: imitation learning to short-horizon optimal trajectories, followed by critic-free group-based reinforcement learning over longer rollouts. Under compute-matched evaluations that measure improvement as a function of both search steps and wall-clock time, NICO-TSP delivers consistently stronger and markedly more step-efficient improvement than prior learned and heuristic search baselines, generalizes far more reliably to larger out-of-distribution instances, and serves both as a competitive replacement for classical local search and as a powerful test-time refinement module for constructive solvers.

翻译：大多数用于旅行商问题（TSP）的神经求解器被训练以输出单一解，尽管实践者通常不会止步于此：在测试时，他们经常花费额外计算资源进行采样或事后搜索。这引出一个自然问题：搜索过程本身能否被学习？神经改进方法通过学习一个对候选解应用局部修改的策略来采纳这一视角，在改进轨迹上累积收益。然而，TSP的学习改进仍相对不成熟，现有方法尚未达到稳健、可扩展的性能。我们认为一个关键原因是设计不匹配：许多方法复用了从单一解方法继承的状态表示、架构选择和训练策略，而非围绕局部搜索的机制构建。这种不匹配催生了NICO-TSP（组合优化的神经改进）：一个用于TSP的2-opt改进框架。NICO-TSP用恰好n个与邻域算子对齐的边缘标记表示当前路径，无需路径位置编码即可直接评分2-opt移动，并通过两阶段流程训练：模仿学习短视最优轨迹，随后在更长展开上执行无评论家的基于群体的强化学习。在测量改进作为搜索步数和挂钟时间的函数的计算匹配评估下，NICO-TSP比先前学习的和启发式搜索基线提供了持续更强且显著更步高效的改进，更可靠地泛化到更大的分布外实例，并既可作为经典局部搜索的竞争性替代品，也可作为构建性求解器的强大测试时精炼模块。