3D single object tracking plays a crucial role in computer vision. Mainstream methods mainly rely on point clouds to achieve geometry matching between target template and search area. However, textureless and incomplete point clouds make it difficult for single-modal trackers to distinguish objects with similar structures. To overcome the limitations of geometry matching, we propose a Multi-modal Multi-level Fusion Tracker (MMF-Track), which exploits the image texture and geometry characteristic of point clouds to track 3D target. Specifically, we first propose a Space Alignment Module (SAM) to align RGB images with point clouds in 3D space, which is the prerequisite for constructing inter-modal associations. Then, in feature interaction level, we design a Feature Interaction Module (FIM) based on dual-stream structure, which enhances intra-modal features in parallel and constructs inter-modal semantic associations. Meanwhile, in order to refine each modal feature, we introduce a Coarse-to-Fine Interaction Module (CFIM) to realize the hierarchical feature interaction at different scales. Finally, in similarity fusion level, we propose a Similarity Fusion Module (SFM) to aggregate geometry and texture clues from the target. Experiments show that our method achieves state-of-the-art performance on KITTI (39% Success and 42% Precision gains against previous multi-modal method) and is also competitive on NuScenes.
翻译:三维单目标跟踪在计算机视觉中扮演着关键角色。主流方法主要依赖点云实现目标模板与搜索区域之间的几何匹配。然而,无纹理且不完整的点云使得单模态跟踪器难以区分具有相似结构的物体。为克服几何匹配的局限性,我们提出了一种多模态多层级融合跟踪器(MMF-Track),该方法利用图像纹理与点云的几何特性进行三维目标跟踪。具体而言,我们首先提出空间对齐模块(SAM),在三维空间中对齐RGB图像与点云,这是构建跨模态关联的前提。接着,在特征交互层级,我们设计基于双流结构的特征交互模块(FIM),该模块并行增强模态内特征并构建模态间语义关联。同时,为细化各模态特征,我们引入多尺度层级特征交互的粗细粒度交互模块(CFIM)。最后,在相似度融合层级,我们提出相似度融合模块(SFM),聚合来自目标的几何与纹理线索。实验表明,我们的方法在KITTI数据集上达到了最先进性能(相较于先前多模态方法,成功率提升39%,精度提升42%),并在NuScenes数据集上同样具有竞争力。