Speech codecs that convert continuous speech signals into discrete tokens have become essential for speech language models (SLMs). However, existing codecs struggle to balance high-quality reconstruction with semantically rich representations, limiting their effectiveness in both generative and understanding tasks. In this work, we propose SAC, a neural speech codec with semantic-acoustic dual-stream quantization. By disentangling semantic and acoustic modeling into two dedicated streams, SAC enables each to be optimized for its respective role. Comprehensive evaluations show that SAC achieves strong reconstruction performance across diverse bitrates under both clean and noisy conditions, with particularly high scores on UTMOS and WER, demonstrating superior perceptual quality and intelligibility. Moreover, SAC substantially outperforms state-of-the-art codecs in semantic representation, achieving a level comparable to that of self-supervised learning (SSL) continuous embeddings. Finally, our analysis of speech disentanglement highlights the effectiveness of the dual-stream design, offering new potential for controllable speech applications.
翻译:将连续语音信号转换为离散标记的语音编解码器已成为语音语言模型(SLM)的关键组件。然而,现有编解码器难以在高质量重建与语义丰富表示之间取得平衡,这限制了其在生成与理解任务中的效能。本研究提出SAC,一种基于语义-声学双流量化的神经语音编解码器。通过将语义建模与声学建模解耦为两个独立处理流,SAC使二者能针对各自功能进行优化。综合评估表明,SAC在干净与带噪环境下均能在不同比特率上实现强劲的重建性能,尤其在UTMOS与WER指标上获得高分,展现出卓越的感知质量与可懂度。此外,SAC在语义表示方面显著优于现有先进编解码器,其表现已达到与自监督学习(SSL)连续嵌入相当的水平。最后,我们对语音解耦的分析验证了双流设计的有效性,为可控语音应用开辟了新的可能性。