DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment

Ke-Han Lu,Zhehuai Chen,Szu-Wei Fu,Chao-Han Huck Yang,Sung-Feng Huang,Chih-Kai Yang,Chee-En Yu,Chun-Wei Chen,Wei-Chih Chen,Chien-yu Huang,Yi-Cheng Lin,Yu-Xiang Lin,Chi-An Fu,Chun-Yi Kuan,Wenze Ren,Xuanjun Chen,Wei-Ping Huang,En-Pei Hu,Tzu-Quan Lin,Yuan-Kuei Wu,Kuan-Po Huang,Hsiao-Ying Huang,Huang-Cheng Chou,Kai-Wei Chang,Cheng-Han Chiang,Boris Ginsburg,Yu-Chiang Frank Wang,Hung-yi Lee

from arxiv, Published in IEEE Transactions on Audio, Speech and Language Processing (TASLP). Model and code available at: https://github.com/kehanlu/DeSTA2.5-Audio

We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following. Recent LALMs augment Large Language Models (LLMs) with auditory capabilities by training on large-scale audio-instruction datasets. However, existing LALMs have often suffered from the catastrophic forgetting of the LLM's original abilities. Therefore, balancing knowledge retention and audio perception has become a critical challenge. To address this, we revisit the data construction pipeline and propose a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets, named DeSTA. This approach aims at preserving the LLM's native language proficiency thereby enabling zero-shot generalization without task-specific tuning. We construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms existing training strategies. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs.

翻译：我们提出DeSTA2.5-Audio，一种旨在实现稳健听觉感知与指令遵循的通用大音频语言模型。近期，通过在大规模音频-指令数据集上训练，大音频语言模型增强了大语言模型的听觉能力。然而，现有大音频语言模型常遭受大语言模型原始能力的灾难性遗忘问题。因此，平衡知识保留与音频感知成为关键挑战。为解决此问题，我们重新审视数据构建流程，提出一种自生成的跨模态对齐策略，其中骨干大语言模型生成自身的训练目标，命名为DeSTA。该方法旨在保留大语言模型的固有语言能力，从而无需任务特定微调即可实现零样本泛化。我们构建了DeSTA-AQA5M，一个包含500万训练样本的大规模任务无关数据集，这些样本源自涵盖语音、环境声音和音乐的50个不同数据集共7000小时音频。DeSTA2.5-Audio在Dynamic-SUPERB、MMAU、SAKURA、Speech-IFEval和VoiceBench等多项音频-语言基准测试中达到最先进或具有竞争力的性能。全面对比研究表明，我们的自生成策略优于现有训练策略。我们的发现强调了在大音频语言模型开发中精心设计数据构建的重要性，并为构建稳健的通用大音频语言模型提供了实用见解。