Visual geolocalization, the task of predicting where an image was taken, remains challenging due to global scale, visual ambiguity, and the inherently hierarchical structure of geography. Existing paradigms rely on either large-scale retrieval, which requires storing a large number of image embeddings, grid-based classifiers that ignore geographic continuity, or generative models that diffuse over space but struggle with fine detail. We introduce an entity-centric formulation of geolocation that replaces image-to-image retrieval with a compact hierarchy of geographic entities embedded in Hyperbolic space. Images are aligned directly to country, region, subregion, and city entities through Geo-Weighted Hyperbolic contrastive learning by directly incorporating haversine distance into the contrastive objective. This hierarchical design enables interpretable predictions and efficient inference with 240k entity embeddings instead of over 5 million image embeddings on the OSV5M benchmark, on which our method establishes a new state-of-the-art performance. Compared to the current methods in the literature, it reduces mean geodesic error by 19.5\%, while improving the fine-grained subregion accuracy by 43%. These results demonstrate that geometry-aware hierarchical embeddings provide a scalable and conceptually new alternative for global image geolocation.
翻译:视觉地理定位(预测图像拍摄位置的任务)由于全球尺度、视觉模糊性以及地理固有的层次结构而仍然具有挑战性。现有范式要么依赖于需要存储大量图像嵌入的大规模检索,要么依赖于忽略地理连续性的基于网格的分类器,要么依赖于在空间上扩散但难以处理精细细节的生成模型。我们提出了一种以实体为中心的地理定位公式,用嵌入双曲空间的紧凑地理实体层次结构取代了图像到图像的检索。通过将半正矢距离直接纳入对比目标,采用地理加权双曲对比学习方法,将图像直接与国家、地区、子区域和城市实体对齐。这种层次化设计实现了可解释的预测和高效推理,在OSV5M基准测试中仅使用24万个实体嵌入而非超过500万个图像嵌入,我们的方法在此基准上建立了新的最先进性能。与文献中的现有方法相比,它将平均测地误差降低了19.5%,同时将细粒度子区域准确率提高了43%。这些结果表明,几何感知的层次化嵌入为全球图像地理定位提供了一种可扩展且概念新颖的替代方案。