We study two foundational problems in audio language models: (1) how to design an audio tokenizer that can serve as an intermediate representation for both understanding and generation; and (2) how to build an audio foundation model that generalizes in few-shot and zero-shot settings, analogous to large language models. To this end, we make the following two contributions. First, we propose ReasoningCodec, a discrete audio codec that factorizes audio into (i) reasoning tokens, which encode text-aligned, high-level analysis and planning representations for audio understanding and hierarchical generation, and (ii) reconstruction tokens, which encode semantic-rich acoustic cues for high-fidelity waveform reconstruction. This design achieves understanding performance comparable to strong continuous representations while improving generation quality and reconstruction fidelity over prior discrete tokenizers. Second, we introduce a unified autoregressive architecture for text and audio, together with multi-stage training and multi-task data construction. Using this framework, we train UniAudio 2.0 on 100B text tokens and 60B audio tokens. Across a wide range of speech, sound, and music tasks, UniAudio 2.0 performs competitively on in-domain evaluations and demonstrates strong few-shot and zero-shot generalization to unseen tasks. Demo, code, and checkpoints will be available at \href{https://dongchaoyang.top/UniAudio2Demo/}{https://dongchaoyang.top/UniAudio2Demo/}.
翻译:我们研究了音频语言模型中的两个基础性问题:(1)如何设计一种既能服务于理解又能服务于生成的音频标记器,作为中间表示;(2)如何构建一个能在少样本和零样本设置下泛化的音频基础模型,类似于大型语言模型。为此,我们做出了以下两项贡献。首先,我们提出了ReasoningCodec,一种离散音频编解码器,它将音频分解为:(i)推理标记,用于编码文本对齐的、高层级的分析和规划表示,以支持音频理解和分层生成;(ii)重建标记,用于编码富含语义的声学线索,以实现高保真度的波形重建。该设计在理解性能上可与强大的连续表示相媲美,同时在生成质量和重建保真度上优于先前的离散标记器。其次,我们引入了一种用于文本和音频的统一自回归架构,并结合多阶段训练和多任务数据构建。利用该框架,我们在1000亿文本标记和600亿音频标记上训练了UniAudio 2.0。在广泛的语音、声音和音乐任务中,UniAudio 2.0在领域内评估中表现优异,并对未见任务展现出强大的少样本和零样本泛化能力。演示、代码和检查点将在 \href{https://dongchaoyang.top/UniAudio2Demo/}{https://dongchaoyang.top/UniAudio2Demo/} 提供。