World models simulate environmental dynamics to enable agents to plan and reason about future states. While existing approaches have primarily focused on visual observations, real-world perception inherently involves multiple sensory modalities. Audio provides crucial spatial and temporal cues such as sound source localization and acoustic scene properties, yet its integration into world models remains relatively underexplored. Prior work has not established a commonly adopted formulation for audio-visual world modeling under low-level action control or clarified how to jointly capture physically grounded binaural audio and visual dynamics. This work presents a unified formulation of Audio-Visual World Models (AVWM), casting multimodal environment simulation as a partially observable Markov decision process with synchronized audio-visual observations. As a foundational step toward this problem, we construct AVW-4k, a controlled benchmark comprising 30 hours of binaural audio-visual trajectories with action annotations across 76 indoor environments. We propose AV-CDiT, an Audio-Visual Conditional Diffusion Transformer with a novel modality expert architecture that balances visual and auditory learning, optimized through a three-stage training strategy for effective multimodal integration. Extensive experiments on this benchmark demonstrate that AV-CDiT achieves high-fidelity multimodal prediction across visual and auditory modalities. Furthermore, we validate its practical utility in embodied navigation, demonstrating that AVWM improves a vision-language-model-guided agent in continuous audio-visual navigation.
翻译:世界模型通过模拟环境动态使智能体能够规划并推理未来状态。尽管现有研究主要聚焦于视觉观测,但真实感知本质上涉及多种感官模态。声音提供了声源定位、声场景属性等关键时空线索,然而其在世界模型中的整合仍相对未被充分探索。现有工作尚未建立低层动作控制下视听世界建模的通用公式,亦未阐明如何联合捕获具有物理根基的双耳音频与视觉动态。本研究提出视听世界模型(AVWM)的统一公式,将多模态环境模拟建模为具有同步音视频观测的部分可观测马尔可夫决策过程。作为解决该问题的基础步骤,我们构建了AVW-4k基准数据集,包含30小时跨76个室内环境的双耳视听轨迹与动作标注。我们提出AV-CDiT,一种采用新颖模态专家架构以平衡视觉与听觉学习的视听条件扩散Transformer,通过三阶段训练策略实现高效多模态整合。在该基准上的大量实验表明,AV-CDiT在视觉与听觉模态上均实现了高保真多模态预测。此外,我们在具身导航任务中验证了其实际效用,证明AVWM可改进视觉语言模型引导的智能体在连续视听导航中的表现。