Audio tokenizers serve as the discrete interface between continuous audio and Audio Language Models (ALMs), but existing tokenizers often struggle to support both understanding and generation. Reconstruction-oriented codecs preserve acoustic fidelity but lack rich semantics, while semantic-aware tokenizers typically rely on separate semantic and acoustic streams, introducing redundancy or misalignment. We propose \textbf{EntangleCodec}, a unified discrete audio tokenizer that learns caption-aligned semantic-acoustic representations before quantization. By aligning audio with rich captions rather than ASR transcripts, EntangleCodec captures linguistic content, speaker identity, emotion, prosody, and acoustic scenes within a compact token stream. A flow-matching diffusion decoder further enables high-quality reconstruction across speech, music, and general audio. EntangleCodec achieves reconstruction quality competitive with specialized codecs, outperforms all codec-based baselines on audio understanding by up to \textbf{+7.4\%} on MMAR, and supports both TTS and TTA generation in a unified framework. Furthermore, EntangleCodec-based audio language models demonstrate strong scaling behavior: even at \textit{0.6B} parameters, the model surpasses specialized continuous-representation LLMs with over \textit{13B} parameters across three benchmarks using \textbf{22$\times$} fewer parameters; scaling to \textit{8B} further establishes new state-of-the-art results on MMAR, highlighting that representation quality is as critical as model scale in audio language modeling. Code and model weights are available at https://github.com/luckyerr/EntangleCodec.
翻译:[translated abstract in Chinese]
音频分词器充当连续音频与音频语言模型之间的离散接口,但现有分词器通常难以同时支持理解与生成任务。面向重建的编解码器保留了声学保真度却缺乏丰富语义,而语义感知分词器通常依赖独立的语义与声学流,引入冗余或错位问题。我们提出\textbf{EntangleCodec}——一种在量化前学习与字幕对齐的语义-声学表示的统一离散音频分词器。通过将音频与丰富的字幕(而非ASR转录文本)对齐,EntangleCodec在紧凑的令牌流中捕捉语言内容、说话人身份、情感、韵律及声学场景。流匹配扩散解码器进一步实现了对语音、音乐及通用音频的高质量重建。EntangleCodec达到了与专用编解码器竞争的重建质量,在音频理解任务上以最高\textbf{+7.4\%}的优势超越所有基于编解码的基线方法(以MMAR为基准),并在统一框架内支持TTS与TTA生成。此外,基于EntangleCodec的音频语言模型展现出强大的缩放行为:即便在\textit{0.6B}参数规模下,该模型以\textbf{22$\times$}更少的参数即在三个基准测试上超越参数超过\textit{13B}的专用连续表示大语言模型;将规模扩展至\textit{8B}后,更在MMAR上确立了新的最优结果,凸显了表示质量与模型规模在音频语言建模中同等关键。代码与模型权重已发布于https://github.com/luckyerr/EntangleCodec。