Recently end-to-end neural audio/speech coding has shown its great potential to outperform traditional signal analysis based audio codecs. This is mostly achieved by following the VQ-VAE paradigm where blind features are learned, vector-quantized and coded. In this paper, instead of blind end-to-end learning, we propose to learn disentangled features for real-time neural speech coding. Specifically, more global-like speaker identity and local content features are learned with disentanglement to represent speech. Such a compact feature decomposition not only achieves better coding efficiency by exploiting bit allocation among different features but also provides the flexibility to do audio editing in embedding space, such as voice conversion in real-time communications. Both subjective and objective results demonstrate its coding efficiency and we find that the learned disentangled features show comparable performance on any-to-any voice conversion with modern self-supervised speech representation learning models with far less parameters and low latency, showing the potential of our neural coding framework.
翻译:摘要:近年来,端到端神经音频/语音编码展现出超越传统基于信号分析的音频编解码器的巨大潜力。这主要源于遵循VQ-VAE范式,通过学习盲特征、向量量化及编码实现。本文提出,并非采用盲端到端学习,而是针对实时神经语音编码学习分离特征。具体而言,通过解耦学习更全局性的说话人身份特征与局部内容特征以表征语音。这种紧凑的特征分解不仅通过差异化特征间的比特分配提升了编码效率,还提供了在嵌入空间中进行音频编辑的灵活性,例如实时通信中的语音转换。主观与客观实验结果均证明了其编码效率,同时发现学习到的分离特征在任意对任意语音转换任务中,以更少的参数量和低延迟达到了与现代自监督语音表征学习模型相当的性能,展现了神经编码框架的应用潜力。