A Generative-First Neural Audio Autoencoder

Neural autoencoders underpin generative models. Practical, large-scale use of neural autoencoders for generative modeling necessitates fast encoding, low latent rates, and a single model across representations. Existing approaches are reconstruction-first: they incur high latent rates, slow encoding, and separate architectures for discrete vs. continuous latents and for different audio channel formats, hindering workflows from preprocessing to inference conditioning. We introduce a generative-first architecture for audio autoencoding that increases temporal downsampling from 2048x to 3360x and supports continuous and discrete representations and common audio channel formats in one model. By balancing compression, quality, and speed, it delivers 10x faster encoding, 1.6x lower rates, and eliminates channel-format-specific variants while maintaining competitive reconstruction quality. This enables applications previously constrained by processing costs: a 60-second mono signal compresses to 788 tokens, making generative modeling more tractable.

翻译：神经自编码器是生成模型的基础。将神经自编码器实际大规模应用于生成建模，需要快速的编码、较低的潜在码率，以及一个跨表示形式的统一模型。现有方法以重建优先为原则：它们导致高潜在码率、编码速度慢，并且针对离散与连续潜在变量以及不同音频通道格式需采用分离的架构，这阻碍了从预处理到推理条件化的整个工作流程。我们提出了一种生成优先的音频自编码架构，它将时间下采样率从2048倍提升至3360倍，并在单一模型中同时支持连续和离散表示以及常见的音频通道格式。通过在压缩率、音质和速度之间取得平衡，该架构实现了10倍更快的编码速度、1.6倍更低的码率，并消除了针对特定通道格式的变体，同时保持了有竞争力的重建质量。这使得以往受处理成本限制的应用成为可能：一段60秒的单声道信号可压缩至788个标记，从而使生成建模更加易于处理。

相关内容

自编码器

关注 141

自动编码器是一种人工神经网络，用于以无监督的方式学习有效的数据编码。自动编码器的目的是通过训练网络忽略信号“噪声”来学习一组数据的表示（编码），通常用于降维。与简化方面一起，学习了重构方面，在此，自动编码器尝试从简化编码中生成尽可能接近其原始输入的表示形式，从而得到其名称。基本模型存在几种变体，其目的是迫使学习的输入表示形式具有有用的属性。自动编码器可有效地解决许多应用问题，从面部识别到获取单词的语义。

【斯坦福博士论文】可控生成与编辑的三维神经表示，

专知会员服务

20+阅读 · 2024年12月8日

【CVPR2024】VideoMAC: 视频掩码自编码器与卷积神经网络

专知会员服务

17+阅读 · 2024年3月4日

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR2023】面向不同视频的可扩展神经表示，

专知会员服务

20+阅读 · 2023年3月28日