Vision Language Place Recognition (VLVPR) enhances robot localization performance by incorporating natural language descriptions from images. By utilizing language information, VLVPR directs robot place matching, overcoming the constraint of solely depending on vision. The essence of multimodal fusion lies in mining the complementary information between different modalities. However, general fusion methods rely on traditional neural architectures and are not well equipped to capture the dynamics of cross modal interactions, especially in the presence of complex intra modal and inter modal correlations. To this end, this paper proposes a novel coarse to fine and end to end connected cross modal place recognition framework, called MambaPlace. In the coarse localization stage, the text description and 3D point cloud are encoded by the pretrained T5 and instance encoder, respectively. They are then processed using Text Attention Mamba (TAM) and Point Clouds Mamba (PCM) for data enhancement and alignment. In the subsequent fine localization stage, the features of the text description and 3D point cloud are cross modally fused and further enhanced through cascaded Cross Attention Mamba (CCAM). Finally, we predict the positional offset from the fused text point cloud features, achieving the most accurate localization. Extensive experiments show that MambaPlace achieves improved localization accuracy on the KITTI360Pose dataset compared to the state of the art methods.
翻译:视觉语言地点识别(VLVPR)通过融合来自图像的自然语言描述,提升了机器人定位性能。利用语言信息,VLVPR能够引导机器人地点匹配,从而克服仅依赖视觉信息的局限。多模态融合的核心在于挖掘不同模态间的互补信息。然而,通用的融合方法依赖于传统的神经架构,难以有效捕捉跨模态交互的动态特性,尤其是在存在复杂的模态内与模态间关联的情况下。为此,本文提出了一种新颖的由粗到细、端到端连接的跨模态地点识别框架,称为MambaPlace。在粗定位阶段,文本描述与三维点云分别通过预训练的T5编码器和实例编码器进行编码。随后,它们经由文本注意力Mamba(TAM)与点云Mamba(PCM)进行处理,以实现数据增强与对齐。在后续的细定位阶段,文本描述与三维点云的特征进行跨模态融合,并通过级联交叉注意力Mamba(CCAM)进一步强化。最后,我们从融合的文本-点云特征中预测位置偏移,实现最精确的定位。大量实验表明,在KITTI360Pose数据集上,MambaPlace相较于现有最优方法取得了更高的定位精度。