Existing neural audio codecs usually sacrifice computational complexity for audio quality. They build the feature transformation layers mainly on convolutional blocks, which are not inherently appropriate for capturing local redundancies of audio signals. As compensation, either adversarial losses from a discriminator or a large number of model parameters are required to improve the codec. To that end, we propose Efficient Speech Codec (ESC), a lightweight parameter-efficient codec laid on cross-scale residual vector quantization and transformers. Our model leverages mirrored hierarchical window-attention transformer blocks and performs step-wise decoding from coarse-to-fine feature representations. To enhance codebook utilization, we design a learning paradigm that involves a pre-training stage to assist with codec training. Extensive results show that ESC can achieve high audio quality with much lower complexity, which is a prospective alternative in place of existing codecs.
翻译:现有神经音频编解码器通常以牺牲计算复杂度为代价换取音频质量。其特征变换层主要基于卷积模块构建,但这本质上不适用于捕获音频信号的局部冗余性。为弥补这一缺陷,通常需要引入判别器的对抗性损失或大量模型参数来提升编解码器性能。为此,我们提出高效语音编解码器(ESC),一种基于跨尺度残差向量量化与Transformer的轻量级参数高效编解码器。该模型采用镜像分层窗口注意力Transformer模块,并执行从粗粒度到细粒度特征表示的逐步解码。为增强码本利用率,我们设计了一种包含预训练阶段的辅助编解码器训练学习范式。大量实验结果表明,ESC能够以更低的复杂度实现高音频质量,是现有编解码器的理想替代方案。