Neural speech codecs aim to compress input signals into minimal bits while maintaining content quality in a low-latency manner. However, existing codecs often trade computational complexity for reconstruction performance. These codecs primarily use convolutional blocks for feature transformation layers, which are not inherently suited for capturing the local redundancies in speech signals. To compensate, they require either adversarial discriminators or a large number of model parameters to enhance audio quality. In response to these challenges, we introduce the Efficient Speech Codec (ESC), a lightweight, parameter-efficient speech codec based on a cross-scale residual vector quantization scheme and transformers. Our model employs mirrored hierarchical window transformer blocks and performs step-wise decoding from coarse-to-fine feature representations. To enhance bitrate efficiency, we propose a novel combination of vector quantization techniques along with a pre-training paradigm. Extensive experiments demonstrate that ESC can achieve high-fidelity speech reconstruction with significantly lower complexity, making it a promising alternative to existing convolutional audio codecs.
翻译:神经语音编解码器旨在以低延迟方式将输入信号压缩为最小比特,同时保持内容质量。然而,现有编解码器通常以计算复杂度换取重建性能。这些编解码器主要使用卷积块作为特征变换层,其本质上并不适合捕捉语音信号中的局部冗余。为弥补这一不足,它们需要对抗性判别器或大量模型参数来提升音频质量。针对这些挑战,我们提出了高效语音编解码器(ESC),这是一种基于跨尺度残差向量量化方案和Transformer的轻量级、参数高效的语音编解码器。我们的模型采用镜像分层窗口Transformer块,并执行从粗粒度到细粒度特征表示的分步解码。为提升比特率效率,我们提出了一种向量量化技术与预训练范式的新颖组合。大量实验表明,ESC能够以显著更低的复杂度实现高保真语音重建,使其成为现有卷积音频编解码器的有前景的替代方案。