The constructive approach within Neural Combinatorial Optimization (NCO) treats a combinatorial optimization problem as a finite Markov decision process, where solutions are built incrementally through a sequence of decisions guided by a neural policy network. To train the policy, recent research is shifting toward a 'self-improved' learning methodology that addresses the limitations of reinforcement learning and supervised approaches. Here, the policy is iteratively trained in a supervised manner, with solutions derived from the current policy serving as pseudo-labels. The way these solutions are obtained from the policy determines the quality of the pseudo-labels. In this paper, we present a simple and problem-independent sequence decoding method for self-improved learning based on sampling sequences without replacement. We incrementally follow the best solution found and repeat the sampling process from intermediate partial solutions. By modifying the policy to ignore previously sampled sequences, we force it to consider only unseen alternatives, thereby increasing solution diversity. Experimental results for the Traveling Salesman and Capacitated Vehicle Routing Problem demonstrate its strong performance. Furthermore, our method outperforms previous NCO approaches on the Job Shop Scheduling Problem.
翻译:神经组合优化(NCO)中的构造性方法将组合优化问题视为一个有限马尔可夫决策过程,其中解决方案通过由神经策略网络指导的一系列决策逐步构建。为了训练策略,近期研究正转向一种“自改进”的学习方法,以解决强化学习和监督方法的局限性。在这种方法中,策略以监督方式进行迭代训练,而源自当前策略的解则作为伪标签。从策略中获取这些解的方式决定了伪标签的质量。本文提出了一种简单且与问题无关的序列解码方法,用于基于无放回采样的自改进学习。我们逐步跟踪所找到的最佳解,并从中间部分解重复采样过程。通过修改策略以忽略先前采样的序列,我们迫使其仅考虑未探索的替代方案,从而增加解的多样性。针对旅行商问题和带容量车辆路径问题的实验结果证明了其强大性能。此外,我们的方法在作业车间调度问题上优于先前的NCO方法。