Autoregressive (AR) large audio language models (LALMs) such as Qwen-2.5-Omni have achieved strong performance on audio understanding and interaction, but scaling them remains costly in data and computation, and strictly sequential decoding limits inference efficiency. Diffusion large language models (dLLMs) have recently been shown to make effective use of limited training data, and prior work on DIFFA indicates that replacing an AR backbone with a diffusion counterpart can substantially improve audio understanding under matched settings, albeit at a proof-of-concept scale without large-scale instruction tuning, preference alignment, or practical decoding schemes. We introduce DIFFA-2, a practical diffusion-based LALM for general audio understanding. DIFFA-2 upgrades the speech encoder, employs dual semantic and acoustic adapters, and is trained with a four-stage curriculum that combines semantic and acoustic alignment, large-scale supervised fine-tuning, and variance-reduced preference optimization, using only fully open-source corpora. Experiments on MMSU, MMAU, and MMAR show that DIFFA-2 consistently improves over DIFFA and is competitive to strong AR LALMs under practical training budgets, supporting diffusion-based modeling is a viable backbone for large-scale audio understanding. Our code is available at https://github.com/NKU-HLT/DIFFA.git.
翻译:自回归(AR)大音频语言模型(LALM),如Qwen-2.5-Omni,已在音频理解与交互任务上展现出强大性能,但其扩展在数据与计算上成本高昂,且严格的顺序解码限制了推理效率。扩散大语言模型(dLLM)近期被证明能有效利用有限的训练数据,先前关于DIFFA的研究表明,在同等条件下,用扩散主干网络替代AR主干网络可显著提升音频理解能力,尽管其尚处于概念验证规模,缺乏大规模指令微调、偏好对齐或实用的解码方案。我们提出了DIFFA-2,一个面向通用音频理解的、基于扩散的实用LALM。DIFFA-2升级了语音编码器,采用双语义与声学适配器,并通过四阶段课程学习进行训练,该训练结合了语义与声学对齐、大规模监督微调以及方差缩减的偏好优化,且仅使用完全开源的语料库。在MMSU、MMAU和MMAR上的实验表明,DIFFA-2相较DIFFA持续取得改进,并在实际训练预算下与强大的AR LALM具有竞争力,这支持了基于扩散的建模可作为大规模音频理解的可行主干网络。我们的代码公开于 https://github.com/NKU-HLT/DIFFA.git。