This paper introduces FlowMAC, a novel neural audio codec for high-quality general audio compression at low bit rates based on conditional flow matching (CFM). FlowMAC jointly learns a mel spectrogram encoder, quantizer and decoder. At inference time the decoder integrates a continuous normalizing flow via an ODE solver to generate a high-quality mel spectrogram. This is the first time that a CFM-based approach is applied to general audio coding, enabling a scalable, simple and memory efficient training. Our subjective evaluations show that FlowMAC at 3 kbps achieves similar quality as state-of-the-art GAN-based and DDPM-based neural audio codecs at double the bit rate. Moreover, FlowMAC offers a tunable inference pipeline, which permits to trade off complexity and quality. This enables real-time coding on CPU, while maintaining high perceptual quality.
翻译:本文提出FlowMAC,一种基于条件流匹配的新型神经音频编解码器,用于在低比特率下实现高质量通用音频压缩。FlowMAC联合学习梅尔频谱编码器、量化器和解码器。在推理阶段,解码器通过常微分方程求解器集成连续归一化流,以生成高质量梅尔频谱。这是首次将基于CFM的方法应用于通用音频编码,实现了可扩展、简单且内存高效的训练。我们的主观评估表明,FlowMAC在3 kbps比特率下达到的质量与当前最先进的基于GAN和DDPM的神经音频编解码器在双倍比特率下的质量相当。此外,FlowMAC提供可调节的推理流程,允许在复杂度与质量之间进行权衡。这使得在CPU上实现实时编码的同时,仍能保持较高的感知质量。