Audio-Visual Intelligence in Large Foundation Models

You Qin,Kai Liu,Shengqiong Wu,Kai Wang,Shijian Deng,Yapeng Tian,Junbin Xiao,Yazhou Xing,Yinghao Ma,Bobo Li,Roger Zimmermann,Lei Cui,Furu Wei,Jiebo Luo,Hao Fei

from arxiv, 56 pages, 16 figures, 24 tables, https://github.com/JavisVerse/Awesome-AVI

Audio-Visual Intelligence (AVI) has emerged as a central frontier in artificial intelligence, bridging auditory and visual modalities to enable machines that can perceive, generate, and interact in the multimodal real world. In the era of large foundation models, joint modeling of audio and vision has become increasingly crucial, i.e., not only for understanding but also for controllable generation and reasoning across dynamic, temporally grounded signals. Recent advances, such as Meta MovieGen and Google Veo-3, highlight the growing industrial and academic focus on unified audio-vision architectures that learn from massive multimodal data. However, despite rapid progress, the literature remains fragmented, spanning diverse tasks, inconsistent taxonomies, and heterogeneous evaluation practices that impede systematic comparison and knowledge integration. This survey provides the first comprehensive review of AVI through the lens of large foundation models. We establish a unified taxonomy covering the broad landscape of AVI tasks, ranging from understanding (e.g., speech recognition, sound localization) to generation (e.g., audio-driven video synthesis, video-to-audio) and interaction (e.g., dialogue, embodied, or agentic interfaces). We synthesize methodological foundations, including modality tokenization, cross-modal fusion, autoregressive and diffusion-based generation, large-scale pretraining, instruction alignment, and preference optimization. Furthermore, we curate representative datasets, benchmarks, and evaluation metrics, offering a structured comparison across task families and identifying open challenges in synchronization, spatial reasoning, controllability, and safety. By consolidating this rapidly expanding field into a coherent framework, this survey aims to serve as a foundational reference for future research on large-scale AVI.

翻译：视听智能已成为人工智能领域的前沿方向，它通过融合听觉与视觉模态，使机器能够感知、生成并与多模态真实世界进行交互。在大规模基础模型时代，音频与视觉的联合建模日益关键——不仅涉及理解能力，还涵盖对动态、时序信号的受控生成与推理。近期进展，如Meta MovieGen和Google Veo-3，凸显了学术界与工业界对统一音频-视觉架构日益增长的关注，这类架构从海量多模态数据中学习。然而，尽管进展迅速，现有研究仍呈现碎片化态势：任务类型多样、分类体系不一致、评估方法异质化，严重阻碍了系统性比较与知识整合。本文首次从大型基础模型视角对视听智能进行系统性综述。我们构建了涵盖广泛任务景观的统一分类体系，涉及理解（如语音识别、声源定位）、生成（如音频驱动视频合成、视频转音频）和交互（如对话、具身智能体接口）三大类。我们系统梳理了方法论基础，包括模态分词化、跨模态融合、自回归与扩散生成、大规模预训练、指令对齐及偏好优化。此外，我们整理了代表性数据集、基准测试与评估指标，对各项任务族进行结构化比较，并揭示了同步性、空间推理、可控性与安全性等方面的开放挑战。通过将这一快速扩张领域整合为统一框架，本综述旨在为未来大规模视听智能研究提供基础性参考。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

音视频大数据基础模型全面综述

专知会员服务

9+阅读 · 5月7日

迈向可解释和可理解的多模态大规模语言模型

专知会员服务

41+阅读 · 2024年12月7日

大模型智能体：概念、前沿和产业实践

专知会员服务

79+阅读 · 2024年8月20日

大模型+遥感？最新《遥感中的人工智能基础模型》综述

专知会员服务

63+阅读 · 2024年8月10日