End-to-end autonomous driving has witnessed remarkable progress. However, the extensive deployment of autonomous vehicles has yet to be realized, primarily due to 1) inefficient multi-modal environment perception: how to integrate data from multi-modal sensors more efficiently; 2) non-human-like scene understanding: how to effectively locate and predict critical risky agents in traffic scenarios like an experienced driver. To overcome these challenges, in this paper, we propose a Multi-Modal fusion transformer incorporating Driver Attention (M2DA) for autonomous driving. To better fuse multi-modal data and achieve higher alignment between different modalities, a novel Lidar-Vision-Attention-based Fusion (LVAFusion) module is proposed. By incorporating driver attention, we empower the human-like scene understanding ability to autonomous vehicles to identify crucial areas within complex scenarios precisely and ensure safety. We conduct experiments on the CARLA simulator and achieve state-of-the-art performance with less data in closed-loop benchmarks. Source codes are available at https://anonymous.4open.science/r/M2DA-4772.
翻译:端到端自动驾驶已取得显著进展。然而,自动驾驶汽车的广泛部署尚未实现,主要归因于:1)多模态环境感知效率不足——如何更高效地集成多模态传感器数据;2)非类人场景理解能力——如何像经验丰富的驾驶员那样有效定位并预测交通场景中的关键风险体。为应对这些挑战,本文提出一种融合驾驶员注意力的多模态变压器架构(M2DA)。为实现多模态数据的高效融合与模态间更高对齐度,我们创新性地设计了基于激光雷达-视觉-注意力融合(LVAFusion)模块。通过引入驾驶员注意力机制,赋予自动驾驶汽车类人场景理解能力,使其能在复杂场景中精准识别关键区域并保障安全性。我们在CARLA模拟器上开展实验,在闭环基准测试中以更少数据量取得了最先进性能。源代码已开源至https://anonymous.4open.science/r/M2DA-4772。