Neural autoencoders underpin generative models. Practical, large-scale use of neural autoencoders for generative modeling necessitates fast encoding, low latent rates, and a single model across representations. Existing approaches are reconstruction-first: they incur high latent rates, slow encoding, and separate architectures for discrete vs. continuous latents and for different audio channel formats, hindering workflows from preprocessing to inference conditioning. We introduce a generative-first architecture for audio autoencoding that increases temporal downsampling from 2048x to 3360x and supports continuous and discrete representations and common audio channel formats in one model. By balancing compression, quality, and speed, it delivers 10x faster encoding, 1.6x lower rates, and eliminates channel-format-specific variants while maintaining competitive reconstruction quality. This enables applications previously constrained by processing costs: a 60-second mono signal compresses to 788 tokens, making generative modeling more tractable.
翻译:神经自编码器是生成模型的基础。将神经自编码器实际大规模应用于生成建模,需要快速的编码速度、较低的潜在码率,以及能够跨表示形式统一的单一模型。现有方法以重建优先:它们导致高潜在码率、编码速度慢,并且针对离散与连续潜在变量以及不同音频通道格式需要独立的架构,这阻碍了从预处理到推理条件化的工作流程。我们提出了一种生成优先的音频自编码架构,它将时间下采样率从2048倍提高到3360倍,并在单一模型中同时支持连续和离散表示以及常见的音频通道格式。通过平衡压缩率、音质和速度,该架构实现了10倍更快的编码速度、1.6倍更低的码率,并消除了针对特定通道格式的变体,同时保持了有竞争力的重建质量。这使得先前受处理成本限制的应用成为可能:一段60秒的单声道信号可压缩至788个标记,从而使生成建模更加易于处理。