Large language models (LLMs) have been widely used as knowledge backbones of Large Audio Language Models (LALMs), yet how much auditory knowledge they encode through text-only pre-training and how this affects downstream performance remains unclear. We study this gap by comparing different LLMs under two text-only and one audio-grounded setting: (1) direct probing on AKB-2000, a curated benchmark testing the breadth and depth of auditory knowledge; (2) cascade evaluation, where LLMs reason over text descriptions from an audio captioner; and (3) audio-grounded evaluation, where each LLM is fine-tuned into a Large Audio Language Model (LALM) with an audio encoder. Our findings reveal that auditory knowledge varies substantially across families, and text-only results are strongly correlated with audio performance. Our work provides empirical grounding for a comprehensive understanding of LLMs in audio research.
翻译:大型语言模型(LLMs)已被广泛用作大型音频语言模型(LALMs)的知识骨干,然而它们通过纯文本预训练编码了多少听觉知识,以及这如何影响下游性能,仍不明确。我们通过比较不同LLMs在两种纯文本设置和一种音频驱动设置下的表现来研究这一差距:(1)直接探测AKB-2000(一个评估听觉知识广度与深度的精选基准);(2)级联评估,即LLMs基于音频描述器输出的文本描述进行推理;(3)音频驱动评估,即每个LLM通过音频编码器微调为大型音频语言模型(LALM)。我们的研究发现,不同模型家族的听觉知识差异显著,且纯文本结果与音频性能高度相关。本研究为全面理解LLMs在音频研究中的作用提供了实证基础。