Neural speech codecs have demonstrated their ability to compress high-quality speech and audio by converting them into discrete token representations. Most existing methods utilize Residual Vector Quantization (RVQ) to encode speech into multiple layers of discrete codes with uniform time scales. However, this strategy overlooks the differences in information density across various speech features, leading to redundant encoding of sparse information, which limits the performance of these methods at low bitrate. This paper proposes MsCodec, a novel multi-scale neural speech codec that encodes speech into multiple layers of discrete codes, each corresponding to a different time scale. This encourages the model to decouple speech features according to their diverse information densities, consequently enhancing the performance of speech compression. Furthermore, we incorporate mutual information loss to augment the diversity among speech codes across different layers. Experimental results indicate that our proposed method significantly improves codec performance at low bitrate.
翻译:神经语音编解码器通过将高质量语音和音频转换为离散的令牌表示,已展现出其压缩能力。现有方法大多采用残差向量量化(RVQ),将语音编码为具有统一时间尺度的多层离散码。然而,该策略忽略了不同语音特征在信息密度上的差异,导致稀疏信息的冗余编码,从而限制了这些方法在低比特率下的性能。本文提出了一种新颖的多尺度神经语音编解码器 MsCodec,它将语音编码为多层离散码,每层对应不同的时间尺度。这促使模型根据语音特征不同的信息密度进行解耦,从而提升语音压缩的性能。此外,我们引入了互信息损失以增强不同层间语音码的多样性。实验结果表明,所提方法在低比特率下显著提升了编解码器的性能。