Neural autoencoders underpin generative models. Practical, large-scale use of neural autoencoders for generative modeling necessitates fast encoding, low latent rates, and a single model across representations. Existing approaches are reconstruction-first: they incur high latent rates, slow encoding, and separate architectures for discrete vs. continuous latents and for different audio channel formats, hindering workflows from preprocessing to inference conditioning. We introduce a generative-first architecture for audio autoencoding that increases temporal downsampling from 2048x to 3360x and supports continuous and discrete representations and common audio channel formats in one model. By balancing compression, quality, and speed, it delivers 10x faster encoding, 1.6x lower rates, and eliminates channel-format-specific variants while maintaining competitive reconstruction quality. This enables applications previously constrained by processing costs: a 60-second mono signal compresses to 788 tokens, making generative modeling more tractable.
翻译:神经自编码器是生成模型的基础。将神经自编码器实际大规模应用于生成建模,需要快速的编码、较低的潜在码率,以及一个跨表示形式的统一模型。现有方法以重建优先为原则:它们导致高潜在码率、编码速度慢,并且针对离散与连续潜在变量以及不同音频通道格式需采用分离的架构,这阻碍了从预处理到推理条件化的整个工作流程。我们提出了一种生成优先的音频自编码架构,它将时间下采样率从2048倍提升至3360倍,并在单一模型中同时支持连续和离散表示以及常见的音频通道格式。通过在压缩率、音质和速度之间取得平衡,该架构实现了10倍更快的编码速度、1.6倍更低的码率,并消除了针对特定通道格式的变体,同时保持了有竞争力的重建质量。这使得以往受处理成本限制的应用成为可能:一段60秒的单声道信号可压缩至788个标记,从而使生成建模更加易于处理。