Even without directly hearing sounds, humans can effortlessly reason about auditory properties, such as pitch, loudness, or sound-source associations, drawing on auditory commonsense. In contrast, language models often lack this capability, limiting their effectiveness in multimodal interactions. As an initial step to address this gap, we present AuditoryBench++, a comprehensive benchmark for evaluating auditory knowledge and reasoning in text-only settings. The benchmark encompasses tasks that range from basic auditory comparisons to contextually grounded reasoning, enabling fine-grained analysis of how models process and integrate auditory concepts. In addition, we introduce AIR-CoT, a novel auditory imagination reasoning method that generates and integrates auditory information during inference through span detection with special tokens and knowledge injection. Extensive experiments with recent LLMs and Multimodal LLMs demonstrate that AIR-CoT generally outperforms both the off-the-shelf models and those augmented with auditory knowledge. The project page is available at https://auditorybenchpp.github.io.
翻译:即使没有直接听到声音,人类也能毫不费力地基于听觉常识推理出听觉属性,例如音高、响度或声源关联。相比之下,语言模型通常缺乏这种能力,这限制了它们在多模态交互中的有效性。作为解决这一差距的初步尝试,我们提出了 AuditoryBench++,一个用于评估纯文本设置下听觉知识与推理能力的综合性基准。该基准涵盖从基础听觉比较到情境化推理的任务,能够对模型如何处理和整合听觉概念进行细粒度分析。此外,我们提出了 AIR-CoT,一种新颖的听觉想象推理方法,该方法在推理过程中通过特殊标记的跨度检测和知识注入来生成并整合听觉信息。对近期大型语言模型和多模态大型语言模型的大量实验表明,AIR-CoT 通常优于现成模型以及那些通过听觉知识增强的模型。项目页面可在 https://auditorybenchpp.github.io 获取。