Feature misalignment in BEV perception is a critical yet often overlooked challenge in autonomous driving, especially under calibration uncertainties between LiDAR and camera sensors. To address this issue, we propose a robust multi-modal fusion framework, GraphBEV++, which systematically mitigates projection-induced misalignment. The framework consists of two key modules: LocalAlign-v2 and GlobalAlign-v2. LocalAlign-v2 introduces neighborhood-aware depth features via graph matching to correct local misalignment. It supports both LSS-based and query-based BEV representations, making it compatible with BEVFusion and BEVFormer architectures for consistent cross-paradigm alignment. GlobalAlign-v2 encompasses two variants: Deformable and Diffusion. The Deformable variant addresses global misalignment in LSS-based multi-modal BEV by explicitly learning cross-modal feature offsets. In contrast, the Diffusion variant targets implicit misalignment in query-based BEV by injecting noise to simulate misalignment and employing a denoising process to recover aligned features. Experimental results show that GraphBEV++ achieves state-of-the-art performance under misalignment noise on nuScenes and Waymo subset, improves long-range detection on Argoverse2, and generalizes effectively to the 3D occupancy prediction task, consistently improving occupancy estimation accuracy and robustness under both clean and noisy settings. Furthermore, GraphBEV++ effectively alleviates misalignment issues in end-to-end autonomous driving. Compared with five baselines (UniAD, VAD, FusionAD, MomAD, and WoTE), it demonstrates superior performance in both open-loop (nuScenes) and closed-loop (Bench2Drive and NAVSIM) evaluations across perception, prediction, and planning tasks.
翻译:在鸟瞰图感知中,特征错位是自动驾驶领域一个关键但常被忽视的挑战,尤其是在激光雷达与相机传感器之间存在标定不确定性的情况下。为解决此问题,我们提出了一种鲁棒的多模态融合框架GraphBEV++,该系统性地缓解了由投影引起的特征错位问题。该框架由两个核心模块构成:LocalAlign-v2与GlobalAlign-v2。LocalAlign-v2通过图匹配引入邻域感知的深度特征以修正局部错位,同时支持基于LSS和基于查询的BEV表示,从而与BEVFusion和BEVFormer架构兼容,实现跨范式的稳定对齐。GlobalAlign-v2包含两种变体:Deformable与Diffusion。其中,Deformable变体通过显式学习跨模态特征偏移量,解决基于LSS的多模态BEV中的全局错位问题;而Diffusion变体则针对基于查询的BEV中的隐式错位,通过注入噪声模拟错位场景,并利用去噪过程恢复对齐后的特征。实验结果显示,GraphBEV++在nuScenes与Waymo子集上的错位噪声条件下达到了最先进的性能,在Argoverse2数据集上提升了远距离检测能力,并有效泛化至3D占据预测任务,在清洁与噪声场景下均持续提升占据估计的准确性与鲁棒性。此外,GraphBEV++有效缓解了端到端自动驾驶中的特征错位问题。与五种基线方法(UniAD、VAD、FusionAD、MomAD及WoTE)相比,其在感知、预测与规划任务的开放式循环(nuScenes)和闭环评估(Bench2Drive与NAVSIM)中均展现出更优性能。