Neural audio/speech coding has recently demonstrated its capability to deliver high quality at much lower bitrates than traditional methods. However, existing neural audio/speech codecs employ either acoustic features or learned blind features with a convolutional neural network for encoding, by which there are still temporal redundancies within encoded features. This paper introduces latent-domain predictive coding into the VQ-VAE framework to fully remove such redundancies and proposes the TF-Codec for low-latency neural speech coding in an end-to-end manner. Specifically, the extracted features are encoded conditioned on a prediction from past quantized latent frames so that temporal correlations are further removed. Moreover, we introduce a learnable compression on the time-frequency input to adaptively adjust the attention paid to main frequencies and details at different bitrates. A differentiable vector quantization scheme based on distance-to-soft mapping and Gumbel-Softmax is proposed to better model the latent distributions with rate constraint. Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than Opus at 9 kbps, and TF-Codec at 3 kbps outperforms both EVS at 9.6 kbps and Opus at 12 kbps. Numerous studies are conducted to demonstrate the effectiveness of these techniques.
翻译:神经音频/语音编码近期展现出在远低于传统方法的码率下提供高品质语音的能力。然而,现有神经音频/语音编解码器采用声学特征或利用卷积神经网络学习的盲特征进行编码,导致编码特征中仍存在时间冗余。本文在VQ-VAE框架中引入潜在域预测编码以完全消除此类冗余,并提出用于低延迟端到端神经语音编码的TF-Codec。具体而言,提取的特征在基于过去量化潜在帧的预测条件下进行编码,从而进一步去除时间相关性。此外,我们在时频输入上引入可学习压缩机制,以自适应调整不同码率下对主频带与细节的关注程度。针对率约束下的潜在分布建模,提出基于距离-软映射和Gumbel-Softmax的可微分向量量化方案。多语种语音数据集的主观测试结果表明,在低延迟条件下,1 kbps时的TF-Codec质量显著优于9 kbps时的Opus,而3 kbps时的TF-Codec超过9.6 kbps时的EVS与12 kbps时的Opus。大量实验验证了这些技术的有效性。