Fun-Audio-Chat Technical Report

Tongyi Fun Team,Qian Chen,Luyao Cheng,Chong Deng,Xiangang Li,Jiaqing Liu,Chao-Hong Tan,Wen Wang,Junhao Xu,Jieping Ye,Qinglin Zhang,Qiquan Zhang,Jingren Zhou

from arxiv, Authors are listed in alphabetical order, 21 pages, open-source at https://github.com/FunAudioLLM/Fun-Audio-Chat

Recent advancements in joint speech-text models show great potential for seamless voice interactions. However, existing models face critical challenges: temporal resolution mismatch between speech tokens (25Hz) and text tokens (~3Hz) dilutes semantic information, incurs high computational costs, and causes catastrophic forgetting of text LLM knowledge. We introduce Fun-Audio-Chat, a Large Audio Language Model addressing these limitations via two innovations from our previous work DrVoice. First, Dual-Resolution Speech Representations (DRSR): the Shared LLM processes audio at efficient 5Hz (via token grouping), while the Speech Refined Head generates high-quality tokens at 25Hz, balancing efficiency (~50% GPU reduction) and quality. Second, Core-Cocktail Training, a two-stage fine-tuning with intermediate merging that mitigates catastrophic forgetting. We then apply Multi-Task DPO Training to enhance robustness, audio understanding, instruction-following and voice empathy. This multi-stage post-training enables Fun-Audio-Chat to retain text LLM knowledge while gaining powerful audio understanding, reasoning, and generation. Unlike recent LALMs requiring large-scale audio-text pre-training, Fun-Audio-Chat leverages pre-trained models and extensive post-training. Fun-Audio-Chat 8B and MoE 30B-A3B achieve competitive performance on Speech-to-Text and Speech-to-Speech tasks, ranking top among similar-scale models on Spoken QA benchmarks. They also achieve competitive to superior performance on Audio Understanding, Speech Function Calling, Instruction-Following and Voice Empathy. We develop Fun-Audio-Chat-Duplex, a full-duplex variant with strong performance on Spoken QA and full-duplex interactions. We open-source Fun-Audio-Chat-8B with training and inference code, and provide an interactive demo, at https://github.com/FunAudioLLM/Fun-Audio-Chat .

翻译：近期联合语音-文本模型的进展展现了无缝语音交互的巨大潜力。然而，现有模型面临关键挑战：语音标记（25Hz）与文本标记（~3Hz）之间的时间分辨率不匹配会稀释语义信息，导致高昂的计算成本，并引发对文本大语言模型知识的灾难性遗忘。我们提出了Fun-Audio-Chat，这是一个大型音频语言模型，通过我们先前工作DrVoice中的两项创新来解决这些局限性。首先，双分辨率语音表征：共享大语言模型以高效的5Hz（通过标记分组）处理音频，而语音精炼头则以25Hz生成高质量标记，从而在效率（约减少50% GPU使用）与质量之间取得平衡。其次，核心-鸡尾酒训练，这是一种包含中间合并的两阶段微调方法，可缓解灾难性遗忘。随后，我们应用多任务DPO训练来增强模型的鲁棒性、音频理解、指令遵循和语音共情能力。这种多阶段后训练使Fun-Audio-Chat能够保留文本大语言模型知识，同时获得强大的音频理解、推理和生成能力。与近期需要大规模音频-文本预训练的大型音频语言模型不同，Fun-Audio-Chat利用预训练模型和广泛的后训练。Fun-Audio-Chat 8B和MoE 30B-A3B在语音转文本和语音转语音任务上取得了有竞争力的性能，在口语问答基准测试中位列同规模模型前列。它们在音频理解、语音功能调用、指令遵循和语音共情方面也取得了具有竞争力乃至更优的性能。我们开发了Fun-Audio-Chat-Duplex，这是一个全双工变体，在口语问答和全双工交互方面表现出色。我们在https://github.com/FunAudioLLM/Fun-Audio-Chat开源了Fun-Audio-Chat-8B及其训练和推理代码，并提供了交互式演示。