In low-bitrate speech coding, end-to-end speech coding networks aim to learn compact yet expressive features and a powerful decoder in a single network. A challenging problem as such results in unwelcome complexity increase and inferior speech quality. In this paper, we propose to separate the representation learning and information reconstruction tasks. We leverage an end-to-end codec for learning low-dimensional discrete tokens and employ a latent diffusion model to de-quantize coded features into a high-dimensional continuous space, relieving the decoder's burden of de-quantizing and upsampling. To mitigate the issue of over-smooth generation, we introduce midway-infilling with less noise reduction and stronger conditioning. In ablation studies, we investigate the hyperparameters for midway-infilling and latent diffusion space with different dimensions. Subjective listening tests show that our model outperforms the state-of-the-art at two low bitrates, 1.5 and 3 kbps. Codes and samples of this work are available on our webpage.
翻译:在低比特率语音编码中,端到端语音编码网络旨在单一网络中学习紧凑且富有表达力的特征以及强大的解码器。这一挑战性问题导致复杂度不必要地增加和语音质量下降。本文提出将表示学习与信息重建任务分离。我们利用端到端编解码器学习低维离散标记,并采用潜扩散模型将编码特征反量化为高维连续空间,从而减轻解码器反量化与上采样的负担。为缓解过度平滑生成问题,我们引入中间填充技术,通过降低噪声减少力度并增强条件约束。消融实验研究了中间填充超参数及不同维度潜扩散空间的配置。主观听觉测试表明,在1.5 kbps和3 kbps两种低比特率下,本模型性能优于现有最优方法。本工作的代码与样本已发布于项目网页。