Large audio language models (ALMs) extend LLMs with auditory understanding. A common approach freezes the LLM and trains only an adapter on self-generated targets. However, this fails for reasoning LLMs (RLMs) whose built-in chain-of-thought traces expose the textual surrogate input, yielding unnatural responses. We propose self-rephrasing, converting self-generated responses into audio-understanding variants compatible with RLMs while preserving distributional alignment. We further fuse and compress multiple audio encoders for stronger representations. For training, we construct a 6M-instance multi-task corpus (2.5M unique prompts) spanning 19K hours of speech, music, and sound. Our 4B-parameter ALM outperforms similarly sized models and surpasses most larger ALMs on related audio-reasoning benchmarks, while preserving textual capabilities with a low training cost. Notably, we achieve the best open-source result on the MMAU-speech and MMSU benchmarks and rank third among all the models.
翻译:大型音频语言模型(ALMs)通过听觉理解能力扩展了大型语言模型(LLMs)。一种常见的方法是冻结LLM,仅基于自生成的目标训练适配器。然而,这对于内置思维链追踪会暴露文本替代输入、从而产生非自然响应的推理型LLMs(RLMs)并不适用。我们提出自重构方法,将自生成的响应转换为与RLMs兼容且保持分布对齐的音频理解变体。我们进一步融合并压缩多个音频编码器以获得更强的表示。为进行训练,我们构建了一个包含600万实例的多任务语料库(250万个独立提示),涵盖1.9万小时的语音、音乐和声音数据。我们提出的40亿参数ALM在相关音频推理基准测试中超越了同等规模的模型,并优于大多数更大的ALMs,同时以较低的训练成本保留了文本能力。值得注意的是,我们在MMAU-speech和MMSU基准测试中取得了开源模型的最佳结果,并在所有模型中位列第三。