In this paper, we propose MDCTCodec, an efficient lightweight end-to-end neural audio codec based on the modified discrete cosine transform (MDCT). The encoder takes the MDCT spectrum of audio as input, encoding it into a continuous latent code which is then discretized by a residual vector quantizer (RVQ). Subsequently, the decoder decodes the MDCT spectrum from the quantized latent code and reconstructs audio via inverse MDCT. During the training phase, a novel multi-resolution MDCT-based discriminator (MR-MDCTD) is adopted to discriminate the natural or decoded MDCT spectrum for adversarial training. Experimental results confirm that, in scenarios with high sampling rates and low bitrates, the MDCTCodec exhibited high decoded audio quality, improved training and generation efficiency, and compact model size compared to baseline codecs. Specifically, the MDCTCodec achieved a ViSQOL score of 4.18 at a sampling rate of 48 kHz and a bitrate of 6 kbps on the public VCTK corpus.
翻译:本文提出MDCTCodec,一种基于修正离散余弦变换(MDCT)的高效轻量级端到端神经音频编解码器。编码器以音频的MDCT频谱作为输入,将其编码为连续潜在代码,随后通过残差向量量化器(RVQ)进行离散化。解码器则从量化后的潜在代码中解码出MDCT频谱,并通过逆MDCT重构音频。在训练阶段,采用了一种新颖的基于多分辨率MDCT的判别器(MR-MDCTD),用于区分原始与解码后的MDCT频谱以进行对抗训练。实验结果证实,在高采样率与低比特率场景下,相较于基线编解码器,MDCTCodec展现出更高的解码音频质量、更优的训练与生成效率以及更紧凑的模型尺寸。具体而言,在公开VCTK语料库上,MDCTCodec在48 kHz采样率与6 kbps比特率条件下取得了4.18的ViSQOL分数。