Recent audio generation models typically rely on Variational Autoencoders (VAEs) and perform generation within the VAE latent space. Although VAEs excel at compression and reconstruction, their latents inherently encode low-level acoustic details rather than semantically discriminative information, leading to entangled event semantics and complicating the training of generative models. To address these issues, we discard VAE acoustic latents and introduce semantic encoder latents, thereby proposing SemanticVocoder, a generative vocoder that directly synthesizes waveforms from semantic latents. Equipped with SemanticVocoder, our text-to-audio generation model achieves a Frechet Distance of 12.823 and a Frechet Audio Distance of 1.709 on the AudioCaps test set, as the introduced semantic latents exhibit superior discriminability compared to acoustic VAE latents. Beyond improved generation performance, it also serves as a promising attempt towards unifying audio understanding and generation within a shared semantic space. Generated samples are available at https://zeyuxie29.github.io/SemanticVocoder/.
翻译:近期音频生成模型通常依赖于变分自编码器(VAE),并在VAE潜在空间内执行生成过程。尽管VAE在压缩与重构方面表现优异,但其潜在表示本质上编码的是低层声学细节而非具有语义区分性的信息,导致事件语义纠缠,并使生成模型的训练复杂化。为解决这些问题,我们摒弃了VAE声学潜在表示,引入语义编码器潜在空间,从而提出语义声码器——一种直接从语义潜在表示合成波形的生成式声码器。搭载语义声码器的文本到音频生成模型在AudioCaps测试集上实现了12.823的弗雷歇距离和1.709的弗雷歇音频距离,这是因为所引入的语义潜在表示相较于声学VAE潜在表示展现出更优的区分能力。除提升生成性能外,该模型亦为在共享语义空间内统一音频理解与生成任务提供了具有前景的探索路径。生成样本可在 https://zeyuxie29.github.io/SemanticVocoder/ 获取。