The traveling purchaser problem (TPP) is an important combinatorial optimization problem with broad applications. Due to the coupling between routing and purchasing, existing works on TPPs commonly address route construction and purchase planning simultaneously, which, however, leads to exact methods with high computational cost and heuristics with sophisticated design but limited performance. In sharp contrast, we propose a novel approach based on deep reinforcement learning (DRL), which addresses route construction and purchase planning separately, while evaluating and optimizing the solution from a global perspective. The key components of our approach include a bipartite graph representation for TPPs to capture the market-product relations, and a policy network that extracts information from the bipartite graph and uses it to sequentially construct the route. One significant benefit of our framework is that we can efficiently construct the route using the policy network, and once the route is determined, the associated purchasing plan can be easily derived through linear programming, while, leveraging DRL, we can train the policy network to optimize the global solution objective. Furthermore, by introducing a meta-learning strategy, the policy network can be trained stably on large-sized TPP instances, and generalize well across instances of varying sizes and distributions, even to much larger instances that are never seen during training. Experiments on various synthetic TPP instances and the TPPLIB benchmark demonstrate that our DRL-based approach can significantly outperform well-established TPP heuristics, reducing the optimality gap by 40%-90%, and also showing an advantage in runtime, especially on large-sized instances.
翻译:旅行商问题(TPP)是一类具有广泛应用的组合优化问题。由于路径规划与采购决策相互耦合,现有TPP研究通常同步处理路径构建与采购规划,这导致精确方法计算成本高、启发式方法设计复杂且性能有限。本文提出一种基于深度强化学习(DRL)的创新方法,将路径构建与采购规划分离处理,同时从全局视角评估并优化解。该方法的核心包括:用于表征市场-商品关系的TPP二分图表示,以及从二分图中提取信息并顺序构建路径的策略网络。本框架的显著优势在于:可借助策略网络高效构建路径,且路径确定后,对应的采购方案可通过线性规划轻松求解;同时,通过深度强化学习训练策略网络以优化全局目标函数。此外,通过引入元学习策略,策略网络能够在大规模TPP实例上稳定训练,并泛化至不同规模与分布的实例,甚至泛化至训练阶段从未出现的更大规模实例。在各类合成TPP实例与TPPLIB基准测试上的实验表明,基于DRL的方法显著优于经典TPP启发式算法,最优性差距降低40%-90%,尤其在大规模实例上展现出明显的运行时间优势。