Neural audio compression models have recently achieved extreme compression rates, enabling efficient latent generative modeling. Conversely, latent generative models have been applied to compression, pushing the limits of continuous and discrete approaches. However, existing methods remain constrained to low-resolution audio and degrade substantially at very low bitrates, where audible artifacts are prominent. In this paper, we present S-PRESSO, a 48kHz sound effect compression model that produces both continuous and discrete embeddings at ultra-low bitrates, down to 0.096 kbps, via offline quantization. Our model relies on a pretrained latent diffusion model to decode compressed audio embeddings learned by a latent encoder. Leveraging the generative priors of the diffusion decoder, we achieve extremely low frame rates, down to 1Hz (750x compression rate), producing convincing and realistic reconstructions at the cost of exact fidelity. Despite operating at high compression rates, we demonstrate that S-PRESSO outperforms both continuous and discrete baselines in audio quality, acoustic similarity and reconstruction metrics.
翻译:神经音频压缩模型近期已实现极高的压缩率,从而支持高效的隐式生成建模。反之,隐式生成模型亦被应用于压缩领域,不断突破连续与离散方法的性能极限。然而,现有方法仍受限于低分辨率音频,且在极低码率下性能显著退化,此时可听伪影尤为突出。本文提出S-PRESSO——一种48kHz音效压缩模型,通过离线量化在低至0.096 kbps的超低码率下同时生成连续与离散嵌入表示。该模型依赖预训练的隐式扩散模型对隐式编码器学习到的压缩音频嵌入进行解码。借助扩散解码器的生成先验,我们实现了低至1Hz的极低帧率(对应750倍压缩率),虽以牺牲精确保真度为代价,但能生成具有说服力且逼真的重构音频。实验表明,即便在极高压缩率下运行,S-PRESSO在音频质量、声学相似度与重构指标上均优于连续与离散基线方法。