Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs. These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks. This adaptation process presents two major limitations. First, ALLMs often suffer from catastrophic forgetting, where crucial textual capabilities like instruction-following are lost after training on audio data. In some cases, models may even hallucinate sounds that are not present in the input audio, raising concerns about reliability. Second, achieving cross-modal alignment between audio and language typically relies on large collections of task-specific question-answer pairs for instruction tuning, making it resource-intensive. To address these issues, previous works have leveraged the backbone LLMs to synthesize general-purpose, caption-style alignment data. In this paper, we propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds. We further extend our approach to multi-audio scenarios, enabling the model to either explain differences between audio inputs or produce unified captions that describe all inputs, thereby enhancing audio-language alignment. We refer to the entire ALLM training framework as bootstrapping audio-language alignment via synthetic data generation from backbone LLMs (BALSa). Experimental results indicate that our method effectively mitigates audio hallucinations while reliably maintaining strong performance on audio understanding and reasoning benchmarks, as well as instruction-following skills. Moreover, incorporating multi-audio training further enhances the model's comprehension and reasoning capabilities. Overall, BALSa offers an efficient and scalable approach to developing ALLMs.
翻译:音频感知大语言模型(ALLM)近年来在理解与处理音频输入方面取得了显著进展。此类模型通常通过基于音频任务的额外训练,从文本大语言模型(LLM)适配而来。这一适配过程存在两大局限:首先,ALLM常遭受灾难性遗忘,即在音频数据训练后丢失如指令跟随等关键文本能力;某些情况下模型甚至可能幻听输入音频中不存在的声音,引发可靠性担忧。其次,实现音频与语言的跨模态对齐通常依赖大量任务特定的问答对进行指令微调,导致资源消耗巨大。针对这些问题,先前研究已利用骨干LLM合成通用型、描述式对齐数据。本文提出一种数据生成框架,可产生类对比训练数据,旨在增强ALLM区分存在与缺失声音的能力。我们进一步将该方法扩展至多音频场景,使模型既能解释不同音频输入间的差异,也能生成描述所有输入的统一字幕,从而强化音频-语言对齐。我们将此完整ALLM训练框架称为"通过骨干LLM合成数据生成实现音频-语言对齐自举(BALSa)"。实验结果表明,该方法能有效缓解音频幻听现象,同时在音频理解与推理基准测试及指令跟随能力上保持稳定优异表现。此外,引入多音频训练进一步提升了模型的理解与推理能力。总体而言,BALSa为开发ALLM提供了一种高效且可扩展的路径。