Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery

Recent image-to-audio models have shown impressive performance on object-centric visual scenes. However, their application to satellite imagery remains limited by the complex, wide-area semantic ambiguity of top-down views. While satellite imagery provides a uniquely scalable source for global soundscape generation, matching these views to real acoustic environments with unique spatial structures is inherently difficult. To address this challenge, we introduce Geo2Sound, a novel task and framework for generating geographically realistic soundscapes from satellite imagery. Specifically, Geo2Sound combines structural geospatial attributes modeling, semantic hypothesis expansion, and geo-acoustic alignment in a unified framework. A lightweight classifier summarizes overhead scenes into compact geographic attributes, multiple sound-oriented semantic hypotheses are used to generate diverse acoustically plausible candidates, and a geo-acoustic alignment module projects geographic attributes into the acoustic embedding space and identifies the candidate most consistent with the candidate sets. Moreover, we establish SatSound-Bench, the first benchmark comprising over 20k high-quality paired satellite images, text descriptions, and real-world audio recordings, collected from the field across more than 10 countries and complemented by three public datasets. Experiments show that Geo2Sound achieves a SOTA FAD of 1.765, outperforming the strongest baseline by 50.0%. Human evaluations further confirm substantial gains in both realism (26.5%) and semantic alignment, validating our high-fidelity synthesis on scale. Project page and source code: https://github.com/Blanketzzz/Geo2Sound

翻译：近期图像到音频模型在面向物体的视觉场景中展现出令人瞩目的性能。然而，这类模型在卫星影像中的应用仍受限于俯视视角下复杂、大范围的语义模糊性。尽管卫星影像为全球声音景观生成提供了独特的大规模数据源，但将这些俯视视角与具有独特空间结构的真实声学环境相匹配本质上极为困难。为应对这一挑战，我们提出Geo2Sound——一种从卫星影像生成地理真实感声音景观的新任务与框架。具体而言，Geo2Sound在统一框架中融合了结构化地理空间属性建模、语义假设扩展与地理声学对齐。轻量级分类器将俯视场景归纳为紧凑的地理属性，多组面向声音的语义假设用于生成多样化的声学合理候选方案，而地理声学对齐模块将地理属性投影至声学嵌入空间，并识别与候选集最一致的方案。此外，我们构建了SatSound-Bench——首个包含来自十余个国家实地采集的超过2万对卫星图像、文本描述及真实音频记录的基准数据集，并辅以三个公开数据集。实验表明，Geo2Sound实现了1.765的SOTA FAD值，比最强基线提升50.0%。人工评估进一步验证了其在真实感（提升26.5%）和语义对齐方面的显著提升，证实了大规模高保真合成的有效性。项目页面与源代码：https://github.com/Blanketzzz/Geo2Sound