This paper introduces a novel neural audio codec targeting high waveform sampling rates and low bitrates named APCodec, which seamlessly integrates the strengths of parametric codecs and waveform codecs. The APCodec revolutionizes the process of audio encoding and decoding by concurrently handling the amplitude and phase spectra as audio parametric characteristics like parametric codecs. It is composed of an encoder and a decoder with the modified ConvNeXt v2 network as the backbone, connected by a quantizer based on the residual vector quantization (RVQ) mechanism. The encoder compresses the audio amplitude and phase spectra in parallel, amalgamating them into a continuous latent code at a reduced temporal resolution. This code is subsequently quantized by the quantizer. Ultimately, the decoder reconstructs the audio amplitude and phase spectra in parallel, and the decoded waveform is obtained by inverse short-time Fourier transform. To ensure the fidelity of decoded audio like waveform codecs, spectral-level loss, quantization loss, and generative adversarial network (GAN) based loss are collectively employed for training the APCodec. To support low-latency streamable inference, we employ feed-forward layers and causal convolutional layers in APCodec, incorporating a knowledge distillation training strategy to enhance the quality of decoded audio. Experimental results confirm that our proposed APCodec can encode 48 kHz audio at bitrate of just 6 kbps, with no significant degradation in the quality of the decoded audio. At the same bitrate, our proposed APCodec also demonstrates superior decoded audio quality and faster generation speed compared to well-known codecs, such as SoundStream, Encodec, HiFi-Codec and AudioDec.
翻译:本文提出一种名为APCodec的新型神经音频编解码器,旨在实现高波形采样率与低比特率的协同优化。该编解码器创新性地融合了参数化编解码器与波形编解码器的优势,通过并行处理幅度谱与相位谱这两种音频参数特征,彻底革新了音频编解码流程。其架构由基于改进型ConvNeXt v2网络的编码器与解码器构成,两者通过残差向量量化(RVQ)机制实现连接。编码器并行压缩音频幅度谱与相位谱,将其融合为具有降时域分辨率的连续隐式编码,该编码随后由量化器完成量化操作。解码器最终并行重构音频幅度谱与相位谱,并通过短时傅里叶逆变换获得解码波形。为确保解码音频的保真度(如波形编解码器所实现的性能),本文采用频谱级损失、量化损失及基于生成对抗网络(GAN)的损失对APCodec进行联合训练。为支持低延迟流式推理,模型引入前馈层与因果卷积层,并结合知识蒸馏训练策略提升解码音频质量。实验结果表明,所提出的APCodec能在仅6 kbps的比特率下编码48 kHz音频,且解码音频质量无显著下降。在同等比特率条件下,与SoundStream、Encodec、HiFi-Codec及AudioDec等知名编解码器相比,APCodec展现更优的解码音频质量与更快的生成速度。