The rapid rise of real-time communication and large language models has significantly increased the importance of speech compression. Deep learning-based neural speech codecs have outperformed traditional signal-level speech codecs in terms of rate-distortion (RD) performance. Typically, these neural codecs employ an encoder-quantizer-decoder architecture, where audio is first converted into latent code feature representations and then into discrete tokens. However, this architecture exhibits insufficient RD performance due to two main drawbacks: (1) the inadequate performance of the quantizer, challenging training processes, and issues such as codebook collapse; (2) the limited representational capacity of the encoder and decoder, making it difficult to meet feature representation requirements across various bitrates. In this paper, we propose a rate-aware learned speech compression scheme that replaces the quantizer with an advanced channel-wise entropy model to improve RD performance, simplify training, and avoid codebook collapse. We employ multi-scale convolution and linear attention mixture blocks to enhance the representational capacity and flexibility of the encoder and decoder. Experimental results demonstrate that the proposed method achieves state-of-the-art RD performance, obtaining 53.51% BD-Rate bitrate saving in average, and achieves 0.26 BD-VisQol and 0.44 BD-PESQ gains.
翻译:实时通信与大型语言模型的快速发展显著提升了语音压缩的重要性。基于深度学习的神经语音编解码器在率失真性能方面已超越传统信号级语音编解码器。典型神经编解码器采用编码器-量化器-解码器架构,先将音频转换为隐层特征表示,再离散化为符号序列。然而该架构存在两大缺陷导致率失真性能不足:(1) 量化器性能受限,训练过程复杂且存在码本坍缩等问题;(2) 编码器与解码器表征能力有限,难以满足多码率下的特征表示需求。本文提出一种速率感知的端到端语音压缩方案,采用先进的通道熵模型替代量化器以提升率失真性能、简化训练并避免码本坍缩。通过多尺度卷积与线性注意力混合模块增强编解码器的表征能力与灵活性。实验结果表明,所提方法实现了当前最优的率失真性能,平均获得53.51%的BD-Rate码率节省,并在BD-VisQol与BD-PESQ指标上分别取得0.26与0.44的性能增益。