Multi-view radar-camera fused 3D object detection provides a farther detection range and more helpful features for autonomous driving, especially under adverse weather. The current radar-camera fusion methods deliver kinds of designs to fuse radar information with camera data. However, these fusion approaches usually adopt the straightforward concatenation operation between multi-modal features, which ignores the semantic alignment with radar features and sufficient correlations across modals. In this paper, we present MVFusion, a novel Multi-View radar-camera Fusion method to achieve semantic-aligned radar features and enhance the cross-modal information interaction. To achieve so, we inject the semantic alignment into the radar features via the semantic-aligned radar encoder (SARE) to produce image-guided radar features. Then, we propose the radar-guided fusion transformer (RGFT) to fuse our radar and image features to strengthen the two modals' correlation from the global scope via the cross-attention mechanism. Extensive experiments show that MVFusion achieves state-of-the-art performance (51.7% NDS and 45.3% mAP) on the nuScenes dataset. We shall release our code and trained networks upon publication.
翻译:多视图雷达-相机融合三维目标检测为自动驾驶提供了更远的检测范围和更有用的特征,尤其是在恶劣天气条件下。当前的雷达-相机融合方法设计了多种方案来融合雷达信息与相机数据。然而,这些融合方法通常采用多模态特征之间的直接拼接操作,忽略了与雷达特征的语义对齐以及跨模态间的充分关联。本文提出MVFusion,一种新颖的多视图雷达-相机融合方法,旨在实现语义对齐的雷达特征并增强跨模态信息交互。为此,我们通过语义对齐雷达编码器(SARE)将语义对齐注入雷达特征,生成图像引导的雷达特征。随后,我们提出雷达引导融合Transformer(RGFT),通过交叉注意力机制在全局范围内融合雷达与图像特征,以强化两种模态之间的相关性。大量实验表明,MVFusion在nuScenes数据集上达到了最先进的性能(51.7% NDS和45.3% mAP)。我们将在论文发表后公开代码和训练好的网络。