Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable-length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, that can run on consumer-grade hardware, together with their training and inference pipeline.
翻译:Stable Audio 3是一系列面向变长音频生成与编辑的快速潜在扩散模型(包含小、中、大三种规模)。由于我们的模型可生成数分钟长度的音频,变长生成机制可避免为短音频生成完整长度音频带来的成本。我们还支持音频修复功能,实现目标音频编辑与短录音续写。我们的潜在扩散模型基于新型语义-声学自编码器运行,该编码器将音频映射至紧凑的潜在空间,在保留音频保真度的同时促进潜在表示中的语义结构,从而实现高效的扩散生成。最后,我们采用对抗性后训练来加速推理并提升生成质量,在减少推理步数的同时提高音频保真度与提示相关性。Stable Audio 3模型基于授权数据与知识共享许可数据进行训练,可在H200 GPU上以低于2秒的速度生成音乐与音效,在MacBook Pro M4上仅需数秒。我们将小规格与中规格模型的权重及其训练推理管线开源,这些模型可在消费级硬件上运行。