Q-learning is a widely used reinforcement learning technique for solving path planning problems. It primarily involves the interaction between an agent and its environment, enabling the agent to learn an optimal strategy that maximizes cumulative rewards. Although many studies have reported the effectiveness of Q-learning, it still faces slow convergence issues in practical applications. To address this issue, we propose the NDR-QL method, which utilizes neural network outputs as heuristic information to accelerate the convergence process of Q-learning. Specifically, we improved the dual-output neural network model by introducing a start-end channel separation mechanism and enhancing the feature fusion process. After training, the proposed NDR model can output a narrowly focused optimal probability distribution, referred to as the guideline, and a broadly distributed suboptimal distribution, referred to as the region. Subsequently, based on the guideline prediction, we calculate the continuous reward function for the Q-learning method, and based on the region prediction, we initialize the Q-table with a bias. We conducted training, validation, and path planning simulation experiments on public datasets. The results indicate that the NDR model outperforms previous methods by up to 5\% in prediction accuracy. Furthermore, the proposed NDR-QL method improves the convergence speed of the baseline Q-learning method by 90\% and also surpasses the previously improved Q-learning methods in path quality metrics.
翻译:Q学习是一种广泛应用于解决路径规划问题的强化学习技术。它主要涉及智能体与其环境之间的交互,使智能体能够学习一种最大化累积奖励的最优策略。尽管许多研究报道了Q学习的有效性,但在实际应用中仍面临收敛速度缓慢的问题。为解决这一问题,我们提出了NDR-QL方法,该方法利用神经网络输出作为启发式信息以加速Q学习的收敛过程。具体而言,我们通过引入起点-终点通道分离机制并增强特征融合过程,改进了双输出神经网络模型。训练后,所提出的NDR模型能够输出一个窄聚焦的最优概率分布(称为指引)和一个广泛分布的次优分布(称为区域)。随后,基于指引预测,我们为Q学习方法计算连续奖励函数;基于区域预测,我们以偏置方式初始化Q表。我们在公共数据集上进行了训练、验证和路径规划仿真实验。结果表明,NDR模型在预测准确率上较先前方法提升最高达5%。此外,所提出的NDR-QL方法将基准Q学习方法的收敛速度提升了90%,并且在路径质量指标上也超越了先前改进的Q学习方法。