Semantic speech tokenizers have become a widely used interface for Audio-LLMs, owing to their compact single-codebook design and strong linguistic alignment. However, their focus on linguistic abstraction induces acoustic blindness, limiting their applicability beyond speech-centric tasks. We propose UniAudio-Token, a framework that empowers semantic tokenizers with general audio perception without compromising speech ability. Instead of altering the semantic paradigm, UniAudio-Token mitigates its information loss through two key innovations: (1) Semantic-Acoustic Primitives (SAP) provide structured supervision by decomposing audio into linguistic content, vocal attributes, and auditory-scene primitives; and (2) Semantic-Acoustic Equilibrium (SAE) introduces a content-aware gating mechanism that adaptively restores fine-grained acoustic details from shallow layers. Extensive evaluations show that UniAudio-Token learns comprehensive universal representations while preserving high-fidelity speech generation. When integrated with downstream LLMs, it outperforms all single-codebook baseline tokenizers on both understanding and generation tasks, effectively serving as a unified audio interface. We publicly release all our code, including training and inference scripts, together with the model checkpoints at https://github.com/Tencent/Universal_Audio_Tokenizer.
翻译:语义语音分词器因其紧凑的单码本设计和强大的语言对齐能力,已成为音频大语言模型中广泛使用的接口。然而,其对语言抽象的关注导致了声学盲区,限制了其在语音中心任务之外的适用性。我们提出UniAudio-Token框架,该框架在不牺牲语音能力的前提下,赋予语义分词器通用音频感知能力。UniAudio-Token并未改变语义范式,而是通过两项关键创新来缓解其信息损失:(1)语义-声学基元通过将音频分解为语言内容、声音属性及听觉场景基元,提供结构化监督;(2)语义-声学均衡引入内容感知门控机制,自适应地从浅层恢复细粒度声学细节。广泛评估表明,UniAudio-Token在学习通用通用表征的同时,保持了高保真语音生成能力。当与下游大语言模型集成时,其在理解和生成任务上均优于所有单码本基线分词器,有效充当了统一音频接口。我们公开发布所有代码(包括训练和推理脚本)及模型检查点于https://github.com/Tencent/Universal_Audio_Tokenizer。