In low-bitrate speech coding, end-to-end speech coding networks aim to learn compact yet expressive features and a powerful decoder in a single network. A challenging problem as such results in unwelcome complexity increase and inferior speech quality. In this paper, we propose to separate the representation learning and information reconstruction tasks. We leverage an end-to-end codec for learning low-dimensional discrete tokens and employ a latent diffusion model to de-quantize coded features into a high-dimensional continuous space, relieving the decoder's burden of de-quantizing and upsampling. To mitigate the issue of over-smooth generation, we introduce midway-infilling with less noise reduction and stronger conditioning. In ablation studies, we investigate the hyperparameters for midway-infilling and latent diffusion space with different dimensions. Subjective listening tests show that our model outperforms the state-of-the-art at two low bitrates, 1.5 and 3 kbps. Codes and samples of this work are available on our webpage.
翻译:在低比特率语音编码中,端到端语音编码网络旨在单一网络中学习紧凑且富有表达力的特征及强大的解码器。此类挑战性问题会导致不必要的复杂度增加和语音质量下降。本文提出将表示学习与信息重建任务分离:利用端到端编解码器学习低维离散令牌,并采用潜在扩散模型将编码特征去量化为高维连续空间,从而减轻解码器在去量化和上采样方面的负担。为缓解过度平滑生成问题,我们引入中途填充策略,降低噪声衰减强度并增强条件约束。消融实验中,我们考察了不同维度下中途填充与潜在扩散空间的超参数设置。主观听测实验表明,在1.5 kbps和3 kbps两个低比特率下,本模型性能均超越现有最优方法。本工作的代码与样本已发布于项目网页。