The popularity of neural audio codecs as speech tokenizers has surged with the advent of Multimodal Large Language Models. New codec architectures with semantic and acoustic disentanglement have emerged. There are two main approaches to introduce semantic information into codec models: one distills semantic information from SSL representations into the first RVQ layer, while the other maintains separate streams for semantic and acoustic features. We propose HybridCodec, a unified architecture that combines both paradigms. It employs separate semantic and acoustic branches while distilling SSL representations into the semantic stream. This design ensures strong disentanglement without requiring an SSL model during inference. HybridCodec shows superior semantic specialization (RVQ-1) on in-domain test set and competitive reconstruction (RVQ-all). We demonstrate its robustness in out-of-domain and zero-shot cross-lingual settings, achieving a 3x speedup over existing dual-stream models.
翻译:随着多模态大语言模型的出现,神经音频编解码器作为语音分词器的应用日益广泛。具有语义与声学解耦特性的新型编解码架构应运而生。目前将语义信息引入编解码模型主要有两种方法:其一从自监督学习表征中提取语义信息并将其蒸馏至首个残差向量量化层,其二则为语义与声学特征分别维护独立数据流。我们提出HybridCodec这一融合双重范式的统一架构:该模型在维持独立语义分支与声学分支的同时,将自监督学习表征蒸馏至语义流中。这种设计在不需推理阶段调用自监督学习模型的前提下,实现了强解耦性能。HybridCodec在域内测试集上展现出卓越的语义专化能力(首个残差向量量化层),同时具备具有竞争力的重建性能(全残差向量量化层)。我们验证了其在跨域与零样本跨语言场景下的鲁棒性,相较现有双流模型实现了3倍加速。