Recent audio-to-image models have shown impressive performance in generating images of specific objects conditioned on their corresponding sounds. However, these models fail to reconstruct real-world landscapes conditioned on environmental soundscapes. To address this gap, we present Geo-contextual Soundscape-to-Landscape (GeoS2L) generation, a novel and practically significant task that aims to synthesize geographically realistic landscape images from environmental soundscapes. To support this task, we construct two large-scale geo-contextual multi-modal datasets, SoundingSVI and SonicUrban, which pair diverse environmental soundscapes with real-world landscape images. We propose SounDiT, a diffusion transformer (DiT)-based model that incorporates environmental soundscapes and geo-contextual scene conditioning to synthesize geographically coherent landscape images. Furthermore, we propose the Place Similarity Score (PSS), a practically-informed geo-contextual evaluation framework to measure consistency between input soundscapes and generated landscape images. Extensive experiments demonstrate that SounDiT outperforms existing baselines in the GeoS2L, while the PSS effectively captures multi-level generation consistency across element, scene,and human perception. Project page: https://gisense.github.io/SounDiT-Page/
翻译:近期音频到图像模型在根据特定物体对应声音生成其图像方面展现出令人印象深刻的性能。然而,这些模型无法根据环境声景重建真实世界景观。为填补这一空白,我们提出了地理上下文声景到景观(GeoS2L)生成这一新颖且具有实际意义的新任务,其目标是从环境声景合成具有地理真实性的景观图像。为支持此任务,我们构建了两个大规模地理上下文多模态数据集SoundingSVI和SonicUrban,将多样化的环境声景与现实世界景观图像进行配对。我们提出SounDiT,一种基于扩散Transformer(DiT)的模型,其通过融入环境声景和地理上下文场景条件来合成地理连贯的景观图像。此外,我们提出了地点相似度评分(PSS),这是一个基于实际需求的地理上下文评估框架,用于衡量输入声景与生成景观图像之间的一致性。大量实验表明,SounDiT在GeoS2L任务上优于现有基线方法,而PSS则能有效捕捉跨元素、场景及人类感知的多层次生成一致性。项目页面:https://gisense.github.io/SounDiT-Page/