Accurate visual localization is crucial for autonomous driving, yet existing methods face a fundamental dilemma: While high-definition (HD) maps provide high-precision localization references, their costly construction and maintenance hinder scalability, which drives research toward standard-definition (SD) maps like OpenStreetMap. Current SD-map-based approaches primarily focus on Bird's-Eye View (BEV) matching between images and maps, overlooking a ubiquitous signal-noisy GPS. Although GPS is readily available, it suffers from multipath errors in urban environments. We propose DiffVL, the first framework to reformulate visual localization as a GPS denoising task using diffusion models. Our key insight is that noisy GPS trajectory, when conditioned on visual BEV features and SD maps, implicitly encode the true pose distribution, which can be recovered through iterative diffusion refinement. DiffVL, unlike prior BEV-matching methods (e.g., OrienterNet) or transformer-based registration approaches, learns to reverse GPS noise perturbations by jointly modeling GPS, SD map, and visual signals, achieving sub-meter accuracy without relying on HD maps. Experiments on multiple datasets demonstrate that our method achieves state-of-the-art accuracy compared to BEV-matching baselines. Crucially, our work proves that diffusion models can enable scalable localization by treating noisy GPS as a generative prior-making a paradigm shift from traditional matching-based methods.
翻译:精确的视觉定位对于自动驾驶至关重要,然而现有方法面临一个根本性困境:虽然高精地图提供高精度定位参考,但其昂贵的构建和维护成本阻碍了可扩展性,这促使研究转向如OpenStreetMap等标准精度地图。当前基于SD地图的方法主要关注图像与地图之间的鸟瞰图匹配,忽视了普遍存在但含噪声的GPS信号。尽管GPS易于获取,但在城市环境中易受多径误差影响。我们提出DiffVL,首个利用扩散模型将视觉定位重新定义为GPS去噪任务的框架。我们的核心见解是:含噪声的GPS轨迹在视觉BEV特征和SD地图的条件下,隐式编码了真实位姿分布,可通过迭代扩散细化恢复。DiffVL不同于先前的BEV匹配方法(如OrienterNet)或基于Transformer的配准方法,它通过联合建模GPS、SD地图和视觉信号来学习逆转GPS噪声扰动,在不依赖高精地图的情况下实现亚米级精度。在多个数据集上的实验表明,相较于BEV匹配基线方法,我们的方法达到了最先进的精度。重要的是,我们的工作证明了扩散模型能够通过将含噪声GPS视为生成先验来实现可扩展的定位——这标志着与传统基于匹配方法的范式转变。