Decision-making for urban autonomous driving is challenging due to the stochastic nature of interactive traffic participants and the complexity of road structures. Although reinforcement learning (RL)-based decision-making scheme is promising to handle urban driving scenarios, it suffers from low sample efficiency and poor adaptability. In this paper, we propose Scene-Rep Transformer to improve the RL decision-making capabilities with better scene representation encoding and sequential predictive latent distillation. Specifically, a multi-stage Transformer (MST) encoder is constructed to model not only the interaction awareness between the ego vehicle and its neighbors but also intention awareness between the agents and their candidate routes. A sequential latent Transformer (SLT) with self-supervised learning objectives is employed to distill the future predictive information into the latent scene representation, in order to reduce the exploration space and speed up training. The final decision-making module based on soft actor-critic (SAC) takes as input the refined latent scene representation from the Scene-Rep Transformer and outputs driving actions. The framework is validated in five challenging simulated urban scenarios with dense traffic, and its performance is manifested quantitatively by the substantial improvements in data efficiency and performance in terms of success rate, safety, and efficiency. The qualitative results reveal that our framework is able to extract the intentions of neighbor agents to help make decisions and deliver more diversified driving behaviors.
翻译:城市自动驾驶决策因交互交通参与者的随机性和道路结构的复杂性而充满挑战。尽管基于强化学习的决策方案在处理城市驾驶场景方面具有潜力,但其存在样本效率低和适应性差的问题。本文提出Scene-Rep Transformer,通过改进场景表示编码和序贯预测潜在蒸馏来提升强化学习决策能力。具体而言,构建了多阶段Transformer编码器,不仅建模自车与邻近车辆间的交互感知,还建模智能体与其候选路线间的意图感知。采用具有自监督学习目标的序贯潜在Transformer,将未来预测信息蒸馏到潜在场景表示中,以减小探索空间并加速训练。基于软演员-评论家的最终决策模块以Scene-Rep Transformer输出的精炼潜在场景表示为输入,输出驾驶动作。该框架在五个高交通密度的挑战性模拟城市场景中进行了验证,其性能通过数据效率和成功率、安全性及效率方面的显著提升得到定量体现。定性结果表明,本框架能够提取邻近智能体的意图以辅助决策,并产生更多样化的驾驶行为。