GraphBEV++: Multi-Modal Feature Alignment for Autonomous Driving

Feature misalignment in BEV perception is a critical yet often overlooked challenge in autonomous driving, especially under calibration uncertainties between LiDAR and camera sensors. To address this issue, we propose a robust multi-modal fusion framework, GraphBEV++, which systematically mitigates projection-induced misalignment. The framework consists of two key modules: LocalAlign-v2 and GlobalAlign-v2. LocalAlign-v2 introduces neighborhood-aware depth features via graph matching to correct local misalignment. It supports both LSS-based and query-based BEV representations, making it compatible with BEVFusion and BEVFormer architectures for consistent cross-paradigm alignment. GlobalAlign-v2 encompasses two variants: Deformable and Diffusion. The Deformable variant addresses global misalignment in LSS-based multi-modal BEV by explicitly learning cross-modal feature offsets. In contrast, the Diffusion variant targets implicit misalignment in query-based BEV by injecting noise to simulate misalignment and employing a denoising process to recover aligned features. Experimental results show that GraphBEV++ achieves state-of-the-art performance under misalignment noise on nuScenes and Waymo subset, improves long-range detection on Argoverse2, and generalizes effectively to the 3D occupancy prediction task, consistently improving occupancy estimation accuracy and robustness under both clean and noisy settings. Furthermore, GraphBEV++ effectively alleviates misalignment issues in end-to-end autonomous driving. Compared with five baselines (UniAD, VAD, FusionAD, MomAD, and WoTE), it demonstrates superior performance in both open-loop (nuScenes) and closed-loop (Bench2Drive and NAVSIM) evaluations across perception, prediction, and planning tasks.

翻译：在鸟瞰图感知中，特征错位是自动驾驶领域一个关键但常被忽视的挑战，尤其是在激光雷达与相机传感器之间存在标定不确定性的情况下。为解决此问题，我们提出了一种鲁棒的多模态融合框架GraphBEV++，该系统性地缓解了由投影引起的特征错位问题。该框架由两个核心模块构成：LocalAlign-v2与GlobalAlign-v2。LocalAlign-v2通过图匹配引入邻域感知的深度特征以修正局部错位，同时支持基于LSS和基于查询的BEV表示，从而与BEVFusion和BEVFormer架构兼容，实现跨范式的稳定对齐。GlobalAlign-v2包含两种变体：Deformable与Diffusion。其中，Deformable变体通过显式学习跨模态特征偏移量，解决基于LSS的多模态BEV中的全局错位问题；而Diffusion变体则针对基于查询的BEV中的隐式错位，通过注入噪声模拟错位场景，并利用去噪过程恢复对齐后的特征。实验结果显示，GraphBEV++在nuScenes与Waymo子集上的错位噪声条件下达到了最先进的性能，在Argoverse2数据集上提升了远距离检测能力，并有效泛化至3D占据预测任务，在清洁与噪声场景下均持续提升占据估计的准确性与鲁棒性。此外，GraphBEV++有效缓解了端到端自动驾驶中的特征错位问题。与五种基线方法（UniAD、VAD、FusionAD、MomAD及WoTE）相比，其在感知、预测与规划任务的开放式循环（nuScenes）和闭环评估（Bench2Drive与NAVSIM）中均展现出更优性能。