A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

Kaiwen Luo,Zhenhong Zhou,Leo Wang,Liang Lin,Yang Xiao,Tianyu Shao,Yuanhe Zhang,Yuxuan Li,Miao Yu,Kailin Lyu,Jiaming Zhang,Dongrui Liu,Li Sun,Yueming Wu,Kai Li,Ting Dang,Xiaojun Jia,Rohan Kumar Das,Xinfeng Li,Siyuan Liang,Qiufeng Wang,Xingjun Ma,Jing Chen,Kun Wang,Junhao Dong,Deqing Zou,Yu Cheng,Xia Hu,Zhigang Zeng,Sen Su,Yang Liu,Yu-Gang Jiang,Philip S. Yu,Yew-Soon Ong

The foundational capabilities established by Large Language Models (LLMs) have paved the way for Multimodal Large Language Models (MLLMs), within which Large Audio Language Models (LALMs) are essential for realizing universal auditory intelligence. Despite their remarkable performance, the escalation of LALMs' capabilities has significantly outpaced the development of systemic frameworks to ensure their trustworthiness. This survey provides a comprehensive investigation into the endogenous mechanisms of LALMs, detailing the architectural innovations and alignment algorithms that facilitate emergent reasoning. Specifically, we analyze how the transition to unified end-to-end frameworks and the integration of continuous acoustic signals inherently expand the attack surface. To rigorously evaluate the risks within these paradigms, we establish a comprehensive taxonomy of trustworthiness, categorizing critical vulnerabilities such as cross-modal jailbreaking, latent acoustic backdoors, and biometric privacy leakage. We review the state-of-the-art through six analytical pillars: hallucination, robustness, safety, privacy, fairness, and authentication. The profound imbalance between a mature offensive landscape and underdeveloped defenses further validates the critical trustworthiness gaps and multidimensional risks facing audio-centric intelligence. Finally, we propose a strategic roadmap advocating for "Defense-in-Depth" architectures, causal auditory world modeling, and intrinsic representation engineering to bridge the gap between empirical performance and intrinsically trustworthy audio intelligence. Our project has been uploaded to GitHub https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs.

翻译：大型语言模型（LLMs）奠定的基础能力为多模态大语言模型（MLLMs）铺平了道路，其中大型音频语言模型（LALMs）对于实现通用听觉智能至关重要。尽管性能卓越，LALMs能力的提升已显著超越了确保其可信度的系统性框架的发展。本综述深入探究了LALMs的内在机制，详述了促进涌现推理能力的架构创新与对齐算法。具体而言，我们分析了向统一端到端框架的转变以及对连续声学信号的集成如何固有地扩大了攻击面。为严格评估这些范式的风险，我们构建了全面的可信度分类体系，对关键脆弱性进行分类，如跨模态越狱、潜在声学后门以及生物特征隐私泄露。我们通过六项分析支柱——幻觉、鲁棒性、安全性、隐私、公平性和身份验证——来评述最新技术。成熟的攻击面与欠发达的防御之间存在的严重不均衡，进一步验证了以音频为中心的智能所面临的关键可信度缺口与多维风险。最后，我们提出一项战略路线图，倡导采用“纵深防御”架构、因果听觉世界建模以及内在表示工程，以弥合实证性能与内在可信音频智能之间的差距。我们的项目已上传至GitHub：https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs。