Audio-Visual Intelligence (AVI) has emerged as a central frontier in artificial intelligence, bridging auditory and visual modalities to enable machines that can perceive, generate, and interact in the multimodal real world. In the era of large foundation models, joint modeling of audio and vision has become increasingly crucial, i.e., not only for understanding but also for controllable generation and reasoning across dynamic, temporally grounded signals. Recent advances, such as Meta MovieGen and Google Veo-3, highlight the growing industrial and academic focus on unified audio-vision architectures that learn from massive multimodal data. However, despite rapid progress, the literature remains fragmented, spanning diverse tasks, inconsistent taxonomies, and heterogeneous evaluation practices that impede systematic comparison and knowledge integration. This survey provides the first comprehensive review of AVI through the lens of large foundation models. We establish a unified taxonomy covering the broad landscape of AVI tasks, ranging from understanding (e.g., speech recognition, sound localization) to generation (e.g., audio-driven video synthesis, video-to-audio) and interaction (e.g., dialogue, embodied, or agentic interfaces). We synthesize methodological foundations, including modality tokenization, cross-modal fusion, autoregressive and diffusion-based generation, large-scale pretraining, instruction alignment, and preference optimization. Furthermore, we curate representative datasets, benchmarks, and evaluation metrics, offering a structured comparison across task families and identifying open challenges in synchronization, spatial reasoning, controllability, and safety. By consolidating this rapidly expanding field into a coherent framework, this survey aims to serve as a foundational reference for future research on large-scale AVI.
翻译:视听智能已成为人工智能领域的前沿方向,它通过融合听觉与视觉模态,使机器能够感知、生成并与多模态真实世界进行交互。在大规模基础模型时代,音频与视觉的联合建模日益关键——不仅涉及理解能力,还涵盖对动态、时序信号的受控生成与推理。近期进展,如Meta MovieGen和Google Veo-3,凸显了学术界与工业界对统一音频-视觉架构日益增长的关注,这类架构从海量多模态数据中学习。然而,尽管进展迅速,现有研究仍呈现碎片化态势:任务类型多样、分类体系不一致、评估方法异质化,严重阻碍了系统性比较与知识整合。本文首次从大型基础模型视角对视听智能进行系统性综述。我们构建了涵盖广泛任务景观的统一分类体系,涉及理解(如语音识别、声源定位)、生成(如音频驱动视频合成、视频转音频)和交互(如对话、具身智能体接口)三大类。我们系统梳理了方法论基础,包括模态分词化、跨模态融合、自回归与扩散生成、大规模预训练、指令对齐及偏好优化。此外,我们整理了代表性数据集、基准测试与评估指标,对各项任务族进行结构化比较,并揭示了同步性、空间推理、可控性与安全性等方面的开放挑战。通过将这一快速扩张领域整合为统一框架,本综述旨在为未来大规模视听智能研究提供基础性参考。