We tackle the problem of 3D point cloud localization based on a few natural linguistic descriptions and introduce a novel neural network, Text2Loc, that fully interprets the semantic relationship between points and text. Text2Loc follows a coarse-to-fine localization pipeline: text-submap global place recognition, followed by fine localization. In global place recognition, relational dynamics among each textual hint are captured in a hierarchical transformer with max-pooling (HTM), whereas a balance between positive and negative pairs is maintained using text-submap contrastive learning. Moreover, we propose a novel matching-free fine localization method to further refine the location predictions, which completely removes the need for complicated text-instance matching and is lighter, faster, and more accurate than previous methods. Extensive experiments show that Text2Loc improves the localization accuracy by up to $2\times$ over the state-of-the-art on the KITTI360Pose dataset. We will make the code publicly available.
翻译:我们研究了基于少量自然语言描述的三维点云定位问题,并引入了一种新型神经网络Text2Loc,以充分解释点与文本之间的语义关系。Text2Loc遵循从粗到精的定位流程:文本-子图全局位置识别,随后进行精确定位。在全局位置识别中,每个文本线索的关系动态通过带有最大池化的层次化Transformer(HTM)捕获,而通过文本-子图对比学习来保持正负样本之间的平衡。此外,我们提出了一种新颖的无匹配精确定位方法,以进一步优化位置预测,该方法完全消除了复杂的文本-实例匹配需求,比先前方法更轻量、更快速且更准确。大量实验表明,在KITTI360Pose数据集上,Text2Loc将定位精度相较于现有最佳方法提升了高达$2\times$。我们将公开代码。