Bird's eye view (BEV) representation has emerged as a dominant solution for describing 3D space in autonomous driving scenarios. However, objects in the BEV representation typically exhibit small sizes, and the associated point cloud context is inherently sparse, which leads to great challenges for reliable 3D perception. In this paper, we propose IS-Fusion, an innovative multimodal fusion framework that jointly captures the Instance- and Scene-level contextual information. IS-Fusion essentially differs from existing approaches that only focus on the BEV scene-level fusion by explicitly incorporating instance-level multimodal information, thus facilitating the instance-centric tasks like 3D object detection. It comprises a Hierarchical Scene Fusion (HSF) module and an Instance-Guided Fusion (IGF) module. HSF applies Point-to-Grid and Grid-to-Region transformers to capture the multimodal scene context at different granularities. IGF mines instance candidates, explores their relationships, and aggregates the local multimodal context for each instance. These instances then serve as guidance to enhance the scene feature and yield an instance-aware BEV representation. On the challenging nuScenes benchmark, IS-Fusion outperforms all the published multimodal works to date. Code is available at: https://github.com/yinjunbo/IS-Fusion.
翻译:鸟瞰图表示已成为自动驾驶场景中描述三维空间的主流方案。然而,鸟瞰图中的目标通常尺寸较小,且相关点云上下文固有稀疏性,这给可靠的三维感知带来了巨大挑战。本文提出IS-Fusion这一创新性多模态融合框架,该框架能够联合捕获实例级和场景级上下文信息。与现有仅关注鸟瞰场景级融合的方法不同,IS-Fusion通过显式融入实例级多模态信息,从而促进以实例为中心的任务(如三维目标检测)。该框架包含层级场景融合模块和实例引导融合模块。层级场景融合模块采用点-网格和网格-区域转换器,以不同粒度捕获多模态场景上下文。实例引导融合模块挖掘实例候选,探索其关联关系,并聚合每个实例的局部多模态上下文。这些实例随后作为引导,增强场景特征并生成实例感知的鸟瞰图表示。在具有挑战性的nuScenes基准测试中,IS-Fusion超越了所有已发表的多模态方法。代码见:https://github.com/yinjunbo/IS-Fusion。