Current methods for end-to-end constructive neural combinatorial optimization usually train a policy using behavior cloning from expert solutions or policy gradient methods from reinforcement learning. While behavior cloning is straightforward, it requires expensive expert solutions, and policy gradient methods are often computationally demanding and complex to fine-tune. In this work, we bridge the two and simplify the training process by sampling multiple solutions for random instances using the current model in each epoch and then selecting the best solution as an expert trajectory for supervised imitation learning. To achieve progressively improving solutions with minimal sampling, we introduce a method that combines round-wise Stochastic Beam Search with an update strategy derived from a provable policy improvement. This strategy refines the policy between rounds by utilizing the advantage of the sampled sequences with almost no computational overhead. We evaluate our approach on the Traveling Salesman Problem and the Capacitated Vehicle Routing Problem. The models trained with our method achieve comparable performance and generalization to those trained with expert data. Additionally, we apply our method to the Job Shop Scheduling Problem using a transformer-based architecture and outperform existing state-of-the-art methods by a wide margin.
翻译:当前用于端到端构造性神经组合优化的方法通常通过专家解的模仿学习或强化学习的策略梯度方法来训练策略。尽管模仿学习直接简单,但需要昂贵的专家解,而策略梯度方法通常计算量大且调优复杂。在本工作中,我们桥接这两种方法,通过在每个周期中使用当前模型对随机实例采样多个解,并选择最佳解作为专家轨迹进行监督模仿学习,从而简化训练过程。为在最小采样次数下逐步获得更优解,我们引入了一种方法,将逐轮随机束搜索与基于可证明策略改进的更新策略相结合。该策略利用采样序列的优势值在轮次间优化策略,且几乎不增加计算开销。我们在旅行商问题和容量受限车辆路径问题上评估了该方法。通过我们的方法训练的模型在性能和泛化能力上可与使用专家数据训练的模型媲美。此外,我们将该方法应用于基于Transformer架构的作业车间调度问题,并大幅超越了现有最先进方法。