Natural language provides an intuitive way to express spatial intent in geospatial applications. While existing localization methods often rely on dense point cloud maps or high-resolution imagery, OpenStreetMap (OSM) offers a compact and freely available map representation that encodes rich semantic and structural information, making it well-suited for large-scale localization. However, text-to-OSM (T2O) localization remains largely unexplored. In this paper, we formulate the T2O localization task, which aims to estimate accurate 2D positions in urban environments from textual scene descriptions without relying on geometric observations or GNSS-based initial location. To support the proposed task, we introduce TOL, a large-scale benchmark spanning multiple continents and diverse urban environments. TOL contains approximately 121K textual queries paired with OSM map tiles and covers about 316 km of road trajectories across Boston, Karlsruhe, and Singapore. We further propose TOLoc, a coarse-to-fine localization framework that explicitly models the semantics of surrounding objects and their directional information. In the coarse stage, direction-aware features are extracted from both textual descriptions and OSM tiles to construct global descriptors, which are used to retrieve candidate locations for the query. In the fine stage, the query text and top-1 retrieved tile are jointly processed, where a dedicated alignment module fuses the textual descriptor and local map features to regress the 2-DoF pose. Experimental results demonstrate that TOLoc achieves strong localization performance, outperforming the best existing method by 6.53\%, 9.93\%, and 8.32\% at 5 m, 10 m, and 25 m thresholds, respectively, and shows strong generalization to unseen environments. Dataset, code and models will be publicly available at: https://github.com/WHU-USI3DV/TOL.
翻译:自然语言为地理空间应用中的空间意图表达提供了一种直观方式。现有定位方法往往依赖稠密点云地图或高分辨率影像,而OpenStreetMap(OSM)作为一种紧凑且免费开放的地图表示,编码了丰富的语义与结构信息,非常适合大规模定位任务。然而,文本到OSM(T2O)定位仍处于待探索阶段。本文首次定义了T2O定位任务——旨在无需几何观测或基于GNSS的初始位置,仅依靠文本场景描述实现城市环境中的精准二维位置估计。为支撑该任务,我们提出了跨越多大洲及多样化城市环境的大规模基准数据集TOL。该基准包含约121K条文本查询及其对应的OSM地图瓦片,覆盖波士顿、卡尔斯鲁厄与新加坡共约316公里的道路轨迹。我们进一步提出TOLoc——一种粗细粒度结合的定位框架,显式建模周围对象的语义信息及其方向特征。在粗粒度阶段,从文本描述与OSM瓦片中分别提取方向感知特征,构建全局描述符以检索查询的候选位置;在细粒度阶段,联合处理查询文本与排名第一的检索瓦片,通过专用对齐模块融合文本描述符与局部地图特征,回归二维自由度的位姿。实验结果表明,TOLoc在5米、10米和25米阈值下分别以6.53%、9.93%和8.32%的绝对提升优于现有最优方法,并展现出强大的未知环境泛化能力。数据集、代码与模型将在 https://github.com/WHU-USI3DV/TOL 公开。