Cross-modal localization using text and point clouds enables robots to localize themselves via natural language descriptions, with applications in autonomous navigation and interaction between humans and robots. In this task, objects often recur across text and point clouds, making spatial relationships the most discriminative cues for localization. Given this characteristic, we present SpatiaLoc, a framework utilizing a coarse-to-fine strategy that emphasizes spatial relationships at both the instance and global levels. In the coarse stage, we introduce a Bezier Enhanced Object Spatial Encoder (BEOSE) that models spatial relationships at the instance level using quadratic Bezier curves. Additionally, a Frequency Aware Encoder (FAE) generates spatial representations in the frequency domain at the global level. In the fine stage, an Uncertainty Aware Gaussian Fine Localizer (UGFL) regresses 2D positions by modeling predictions as Gaussian distributions with a loss function aware of uncertainty. Extensive experiments on KITTI360Pose demonstrate that SpatiaLoc significantly outperforms existing state-of-the-art (SOTA) methods.
翻译:基于文本与点云的跨模态定位使机器人能够通过自然语言描述实现自身定位,在自主导航和人机交互中具有重要应用。在此任务中,对象常在文本与点云中重复出现,使得空间关系成为最具判别性的定位线索。基于此特性,我们提出SpatiaLoc框架,采用由粗到精的策略,在实例级和全局级同时强化空间关系建模。在粗粒度阶段,我们提出贝塞尔增强对象空间编码器(BEOSE),利用二次贝塞尔曲线在实例级建模空间关系;同时引入频率感知编码器(FAE)在全局级生成频域空间表征。在细粒度阶段,不确定性感知高斯精细定位器(UGFL)将预测建模为高斯分布,并通过不确定性感知的损失函数回归二维位置。在KITTI360Pose数据集上的大量实验表明,SpatiaLoc显著优于现有最先进(SOTA)方法。