We introduce the Pointer Q-Network (PQN), a hybrid neural architecture that integrates model-free Q-value policy approximation with Pointer Networks (Ptr-Nets) to enhance the optimality of attention-based sequence generation, focusing on long-term outcomes. This integration proves particularly effective in solving combinatorial optimization (CO) tasks, especially the Travelling Salesman Problem (TSP), which is the focus of our study. We address this challenge by defining a Markov Decision Process (MDP) compatible with PQN, which involves iterative graph embedding, encoding and decoding by an LSTM-based recurrent neural network. This process generates a context vector and computes raw attention scores, which are dynamically adjusted by Q-values calculated for all available state-action pairs before applying softmax. The resulting attention vector is utilized as an action distribution, with actions selected hinged to exploration-exploitation dynamic adaptibility of PQN. Our empirical results demonstrate the efficacy of this approach, also testing the model in unstable environments.
翻译:本文提出指针Q网络(PQN),一种混合神经架构,将无模型Q值策略近似与指针网络(Ptr-Nets)相结合,以增强基于注意力的序列生成在长期结果上的最优性。该集成方法在解决组合优化(CO)任务——特别是我们研究重点旅行商问题(TSP)——中表现出显著有效性。我们通过定义与PQN兼容的马尔可夫决策过程(MDP)来应对这一挑战,该过程涉及基于LSTM的循环神经网络进行迭代图嵌入、编码和解码。此过程生成上下文向量并计算原始注意力分数,这些分数在应用softmax前会根据所有可用状态-动作对计算的Q值进行动态调整。生成的注意力向量被用作动作分布,其动作选择依赖于PQN的探索-利用动态自适应机制。实证结果验证了该方法的有效性,模型在不稳定环境中的测试也得到验证。