Geo-localization aims to infer the geographic origin of a given signal. In computer vision, geo-localization has served as a demanding benchmark for compositional reasoning and is relevant to public safety. In contrast, progress on audio geo-localization has been constrained by the lack of high-quality audio-location pairs. To address this gap, we introduce AGL1K, the first audio geo-localization benchmark for audio language models (ALMs), spanning 72 countries and territories. To extract reliably localizable samples from a crowd-sourced platform, we propose the Audio Localizability metric that quantifies the informativeness of each recording, yielding 1,444 curated audio clips. Evaluations on 16 ALMs show that ALMs have emerged with audio geo-localization capability. We find that closed-source models substantially outperform open-source models, and that linguistic clues often dominate as a scaffold for prediction. We further analyze ALMs' reasoning traces, regional bias, error causes, and the interpretability of the localizability metric. Overall, AGL1K establishes a benchmark for audio geo-localization and may advance ALMs with better geospatial reasoning capability.
翻译:地理定位旨在推断给定信号的地理来源。在计算机视觉领域,地理定位已成为组合推理能力的一项严苛基准,并与公共安全密切相关。相比之下,音频地理定位的研究进展因缺乏高质量的音频-地理位置配对数据而受到限制。为弥补这一空白,我们提出了AGL1K——首个面向音频-语言模型的音频地理定位基准数据集,涵盖72个国家和地区。为了从众包平台中提取可靠的可定位样本,我们提出了"音频可定位性"指标,该指标量化了每条录音的信息丰富度,最终筛选出1,444条精选音频片段。对16个音频-语言模型的评估表明,此类模型已展现出音频地理定位能力。研究发现:闭源模型显著优于开源模型;语言线索常作为预测的主要推理框架。我们进一步分析了音频-语言模型的推理轨迹、区域偏见、错误成因以及可定位性指标的可解释性。总体而言,AGL1K为音频地理定位建立了基准,有望推动音频-语言模型发展出更优的地理空间推理能力。