Worldwide geolocalization aims to locate the precise location at the coordinate level of photos taken anywhere on the Earth. It is very challenging due to 1) the difficulty of capturing subtle location-aware visual semantics, and 2) the heterogeneous geographical distribution of image data. As a result, existing studies have clear limitations when scaled to a worldwide context. They may easily confuse distant images with similar visual contents, or cannot adapt to various locations worldwide with different amounts of relevant data. To resolve these limitations, we propose G3, a novel framework based on Retrieval-Augmented Generation (RAG). In particular, G3 consists of three steps, i.e., Geo-alignment, Geo-diversification, and Geo-verification to optimize both retrieval and generation phases of worldwide geolocalization. During Geo-alignment, our solution jointly learns expressive multi-modal representations for images, GPS and textual descriptions, which allows us to capture location-aware semantics for retrieving nearby images for a given query. During Geo-diversification, we leverage a prompt ensembling method that is robust to inconsistent retrieval performance for different image queries. Finally, we combine both retrieved and generated GPS candidates in Geo-verification for location prediction. Experiments on two well-established datasets IM2GPS3k and YFCC4k verify the superiority of G3 compared to other state-of-the-art methods.
翻译:全球地理定位旨在确定拍摄于地球任意位置照片的精确坐标级位置。该任务极具挑战性,原因在于:1) 难以捕捉细微的位置感知视觉语义;2) 图像数据存在异质性的地理分布。因此,现有研究在扩展至全球范围时存在明显局限:易将视觉内容相似但距离遥远的图像混淆,或无法适应全球不同地区数据量的差异。为克服这些局限,我们提出基于检索增强生成(RAG)的新型框架G3。该框架通过地理对齐、地理多样化和地理验证三个步骤,协同优化全球地理定位的检索与生成阶段。在地理对齐阶段,我们的方案联合学习图像、GPS坐标及文本描述的表达性多模态表示,从而捕捉位置感知语义以检索查询图像邻近的参考图像。地理多样化阶段采用提示集成方法,其对不同图像查询的检索性能差异具有鲁棒性。最后,我们在地理验证阶段综合检索与生成的GPS候选坐标进行位置预测。在IM2GPS3k和YFCC4k两个权威数据集上的实验验证了G3相较于其他前沿方法的优越性。