Humans process rich auditory environments through tightly integrated cognitive capabilities such as audio perception, audio reasoning, and memory. Despite recent progress in large audio-language models (LALMs) across speech understanding and multimodal audio reasoning, current evaluation paradigms remain largely task- or modality-centric, focusing on end performance while overlooking underlying auditory cognitive behaviours. This reveals a fundamental gap between how auditory cognition is understood in humans and how it is evaluated in LALMs, particularly in the lack of frameworks that operationalise cognitive principles beyond task-level metrics to systematically capture model behaviour. In this work, we introduce RAIL, a human-centric evaluation paradigm grounded in the Cattell-Horn-Carroll (CHC) cognitive framework. RAIL formalises auditory cognition into five core capabilities and develop them into structured evaluation tasks that probe how models process, retain, and integrate auditory information. We further construct a cognitively grounded benchmark with principled data curation and human-aligned evaluation protocols. Evaluating 26 state-of-the-art LALMs, we find that current models exhibit highly uneven performance across cognitive abilities. RAIL establishes a new evaluation paradigm that moves beyond task-centric benchmarking toward cognitively grounded assessment of auditory intelligence.
翻译:人类通过紧密整合的认知能力(如听觉感知、听觉推理和记忆)来处理丰富的听觉环境。尽管近年来大型音频语言模型在语音理解和多模态音频推理方面取得了进展,但当前的评估范式仍以任务或模态为中心,侧重于最终性能,忽略了底层听觉认知行为。这揭示了人类听觉认知的理解与LALMs评估方式之间的根本性差距,尤其缺乏将认知原理操作化、超越任务级指标以系统捕获模型行为的框架。本研究提出RAIL——一种基于卡特尔-霍恩-卡罗尔认知框架的人本评估范式。RAIL将听觉认知形式化为五项核心能力,并将其发展为结构化评估任务,以探究模型处理、保留和整合听觉信息的方式。我们进一步通过原则性数据整理和人类对齐的评估协议,构建了一个认知驱动的基准。对26个最新LALMs的评估表明,当前模型在认知能力上表现出显著的不均衡性。RAIL建立了一个新的评估范式,将听觉智能的评估从以任务为中心的基准测试转向认知基础的评估。