Neural speech codecs have gained great attention for their outstanding reconstruction with discrete token representations. It is a crucial component in generative tasks such as speech coding and large language models (LLM). However, most works based on residual vector quantization perform worse with fewer tokens due to low coding efficiency for modeling complex coupled information. In this paper, we propose a neural speech codec named FreeCodec which employs a more effective encoding framework by decomposing intrinsic properties of speech into different components: 1) a global vector is extracted as the timbre information, 2) a prosody encoder with a long stride level is used to model the prosody information, 3) the content information is from a content encoder. Using different training strategies, FreeCodec achieves state-of-the-art performance in reconstruction and disentanglement scenarios. Results from subjective and objective experiments demonstrate that our framework outperforms existing methods.
翻译:神经语音编解码器因其利用离散标记表示实现出色的重建效果而受到广泛关注。它是语音编码和大型语言模型(LLM)等生成任务中的关键组件。然而,大多数基于残差向量量化的方法由于在建模复杂耦合信息时编码效率较低,在标记数量较少时性能会下降。本文提出了一种名为 FreeCodec 的神经语音编解码器,它通过将语音的内在属性分解为不同组件,采用了一种更有效的编码框架:1)提取一个全局向量作为音色信息;2)使用具有长步进层级的韵律编码器来建模韵律信息;3)内容信息来自内容编码器。通过采用不同的训练策略,FreeCodec 在重建和解耦场景中均实现了最先进的性能。主观与客观实验结果表明,我们的框架优于现有方法。