While video-to-audio generation has achieved remarkable progress in semantic and temporal alignment, most existing studies focus solely on these aspects, paying limited attention to the spatial perception and immersive quality of the synthesized audio. This limitation stems largely from current models' reliance on mono audio datasets, which lack the binaural spatial information needed to learn visual-to-spatial audio mappings. To address this gap, we introduce two key contributions: we construct BinauralVGGSound, the first large-scale video-binaural audio dataset designed to support spatially aware video-to-audio generation; and we propose a end-to-end spatial audio generation framework guided by visual cues, which explicitly models spatial features. Our framework incorporates a visual-guided audio spatialization module that ensures the generated audio exhibits realistic spatial attributes and layered spatial depth while maintaining semantic and temporal alignment. Experiments show that our approach substantially outperforms state-of-the-art models in spatial fidelity and delivers a more immersive auditory experience, without sacrificing temporal or semantic consistency. All datasets, code, and model checkpoints will be publicly released to facilitate future research.
翻译:尽管视频到音频生成在语义和时间对齐方面取得了显著进展,但现有研究大多仅关注这些方面,对合成音频的空间感知和沉浸式质量关注有限。这一局限性主要源于当前模型对单声道音频数据集的依赖,这些数据集缺乏学习视觉到空间音频映射所需的双耳空间信息。为弥补这一差距,我们引入了两项关键贡献:我们构建了BinauralVGGSound,这是首个为支持空间感知视频到音频生成而设计的大规模视频-双耳音频数据集;并提出了一种由视觉线索引导的端到端空间音频生成框架,该框架显式地对空间特征进行建模。我们的框架包含一个视觉引导的音频空间化模块,确保生成的音频在保持语义和时间对齐的同时,展现出逼真的空间属性和分层的空间深度。实验表明,我们的方法在空间保真度上显著优于现有最先进的模型,并能提供更沉浸的听觉体验,且不牺牲时间或语义一致性。所有数据集、代码和模型检查点都将公开发布,以促进未来研究。