Audio codecs power discrete music generative modelling, music streaming and immersive media by shrinking PCM audio to bandwidth-friendly bit-rates. Recent works have gravitated towards processing in the spectral domain; however, spectrogram-domains typically struggle with phase modeling which is naturally complex-valued. Most frequency-domain neural codecs either disregard phase information or encode it as two separate real-valued channels, limiting spatial fidelity. This entails the need to introduce adversarial discriminators at the expense of convergence speed and training stability to compensate for the inadequate representation power of the audio signal. In this work we introduce an end-to-end complex-valued RVQ-VAE audio codec that preserves magnitude-phase coupling across the entire analysis-quantization-synthesis pipeline and removes adversarial discriminators and diffusion post-filters. Without GANs or diffusion we match or surpass much longer-trained baselines in-domain and reach SOTA out-of-domain performance. Compared to standard baselines that train for hundreds of thousands of steps, our model reducing training budget by an order of magnitude is markedly more compute-efficient while preserving high perceptual quality.
翻译:音频编解码器通过将PCM音频压缩至带宽友好的比特率,为离散音乐生成建模、音乐流媒体和沉浸式媒体提供支持。近期研究倾向于在频域进行处理;然而,谱图域通常难以处理相位建模,而相位本质上是复数值的。大多数频域神经编解码器要么忽略相位信息,要么将其编码为两个独立的实数值通道,从而限制了空间保真度。这导致需要引入对抗性判别器以补偿音频信号表征能力的不足,但会牺牲收敛速度和训练稳定性。本文提出一种端到端的复数值RVQ-VAE音频编解码器,其在完整的分析-量化-合成流程中保持幅度-相位耦合,并移除了对抗性判别器和扩散后滤波器。在不使用GAN或扩散模型的情况下,我们在域内性能上匹配或超越了训练时长更长的基线模型,并在域外性能上达到了SOTA水平。与需要训练数十万步的标准基线相比,我们的模型将训练成本降低了一个数量级,在保持高感知质量的同时显著提升了计算效率。