The design of diffusion-based audio generation systems has been investigated from diverse perspectives, such as data space, network architecture, and conditioning techniques, while most of these innovations require model re-training. In sampling, classifier-free guidance (CFG) has been uniformly adopted to enhance generation quality by strengthening condition alignment. However, CFG often compromises diversity, resulting in suboptimal performance. Although the recent autoguidance (AG) method proposes another direction of guidance that maintains diversity, its direct application in audio generation has so far underperformed CFG. In this work, we introduce AudioMoG, an improved sampling method that enhances text-to-audio (T2A) and video-to-audio (V2A) generation quality without requiring extensive training resources. We start with an analysis of both CFG and AG, examining their respective advantages and limitations for guiding diffusion models. Building upon our insights, we introduce a mixture-of-guidance framework that integrates diverse guidance signals with their interaction terms (e.g., the unconditional bad version of the model) to maximize cumulative advantages. Experiments show that, given the same inference speed, our approach consistently outperforms single guidance in T2A generation across sampling steps, concurrently showing advantages in V2A, text-to-music, and image generation. Demo samples are available at: https://audiomog.github.io.
翻译:扩散式音频生成系统的设计已从数据空间、网络架构及条件控制技术等多个维度展开研究,但多数创新需要重新训练模型。在采样过程中,无分类器引导(CFG)通过强化条件对齐来提升生成质量,已成为通用方案。然而CFG往往以牺牲多样性为代价,导致次优表现。尽管近期提出的自引导(AG)方法开辟了保持多样性的新引导方向,但其在音频生成领域的直接应用效果仍不及CFG。本文提出AudioMoG——一种改进的采样方法,无需大量训练资源即可增强文本到音频(T2A)及视频到音频(V2A)的生成质量。我们首先对CFG和AG进行对比分析,揭示二者在引导扩散模型中的优势与局限;基于该认知,提出融合多样引导信号及其交互项(如模型的无条件劣化版本)的混合引导框架,实现累积增益最大化。实验表明,在相同推理速度下,本方法在T2A生成中始终优于单一引导策略,并在V2A、文本到音乐及图像生成任务中展现显著优势。示例音频详见:https://audiomog.github.io