Learning Online Belief Prediction for Efficient POMDP Planning in Autonomous Driving

Effective decision-making in autonomous driving relies on accurate inference of other traffic agents' future behaviors. To achieve this, we propose an online belief-update-based behavior prediction model and an efficient planner for Partially Observable Markov Decision Processes (POMDPs). We develop a Transformer-based prediction model, enhanced with a recurrent neural memory model, to dynamically update latent belief state and infer the intentions of other agents. The model can also integrate the ego vehicle's intentions to reflect closed-loop interactions among agents, and it learns from both offline data and online interactions. For planning, we employ a Monte-Carlo Tree Search (MCTS) planner with macro actions, which reduces computational complexity by searching over temporally extended action steps. Inside the MCTS planner, we use predicted long-term multi-modal trajectories to approximate future updates, which eliminates iterative belief updating and improves the running efficiency. Our approach also incorporates deep Q-learning (DQN) as a search prior, which significantly improves the performance of the MCTS planner. Experimental results from simulated environments validate the effectiveness of our proposed method. The online belief update model can significantly enhance the accuracy and temporal consistency of predictions, leading to improved decision-making performance. Employing DQN as a search prior in the MCTS planner considerably boosts its performance and outperforms an imitation learning-based prior. Additionally, we show that the MCTS planning with macro actions substantially outperforms the vanilla method in terms of performance and efficiency.

翻译：自动驾驶中的有效决策依赖于对其他交通参与者未来行为的准确推断。为实现这一目标，我们提出了一种基于在线信念更新的行为预测模型以及一种用于部分可观测马尔可夫决策过程（POMDP）的高效规划器。我们开发了一种基于Transformer的预测模型，并通过循环神经记忆模型进行增强，以动态更新潜在信念状态并推断其他参与者的意图。该模型还能整合自车意图以反映参与者间的闭环交互，并且能够从离线数据与在线交互中学习。在规划方面，我们采用了带有宏动作的蒙特卡洛树搜索（MCTS）规划器，其通过搜索时间上延展的动作步骤来降低计算复杂度。在MCTS规划器内部，我们使用预测的长时程多模态轨迹来近似未来的更新，这消除了迭代式的信念更新并提高了运行效率。我们的方法还融入了深度Q学习（DQN）作为搜索先验，这显著提升了MCTS规划器的性能。仿真环境中的实验结果验证了我们所提方法的有效性。在线信念更新模型能够显著提升预测的准确性与时间一致性，从而改善决策性能。在MCTS规划器中采用DQN作为搜索先验，极大地提升了其性能，并超越了基于模仿学习的先验方法。此外，我们证明了采用宏动作的MCTS规划在性能与效率方面均显著优于基础方法。