Recent years have witnessed remarkable progress in Text-to-Audio Generation (TTA), providing sound creators with powerful tools to transform inspirations into vivid audio. Yet despite these advances, current TTA systems often suffer from slow inference speed, which greatly hinders the efficiency and smoothness of audio creation. In this paper, we present MeanAudio, a fast and faithful text-to-audio generator capable of rendering realistic sound with only one function evaluation (1-NFE). MeanAudio leverages: (i) the MeanFlow objective with guided velocity target that significantly accelerates inference speed, (ii) an enhanced Flux-style transformer with dual text encoders for better semantic alignment and synthesis quality, and (iii) an efficient instantaneous-to-mean curriculum that speeds up convergence and enables training on consumer-grade GPUs. Through a comprehensive evaluation study, we demonstrate that MeanAudio achieves state-of-the-art performance in single-step audio generation. Specifically, it achieves a real-time factor (RTF) of 0.013 on a single NVIDIA RTX 3090, yielding a 100x speedup over SOTA diffusion-based TTA systems. Moreover, MeanAudio also shows strong performance in multi-step generation, enabling smooth transitions across successive synthesis steps.
翻译:近年来,文本到音频生成领域取得了显著进展,为声音创作者提供了将灵感转化为生动音频的强大工具。然而,尽管取得了这些进步,当前的文本到音频生成系统通常存在推理速度缓慢的问题,这极大地阻碍了音频创作的效率与流畅性。本文提出MeanAudio,一种快速且忠实的文本到音频生成器,能够仅通过一次函数评估即可渲染出逼真的声音。MeanAudio的核心技术包括:(i)采用引导速度目标的MeanFlow目标函数,显著提升了推理速度;(ii)配备双文本编码器的增强型Flux风格Transformer,以实现更好的语义对齐与合成质量;(iii)一种高效的瞬时到均值课程学习策略,加速了模型收敛并使其能够在消费级GPU上进行训练。通过全面的评估研究,我们证明MeanAudio在单步音频生成任务中达到了最先进的性能。具体而言,在单张NVIDIA RTX 3090上,其实现了0.013的实时因子,相比基于扩散的当前最优文本到音频生成系统获得了100倍的加速。此外,MeanAudio在多步生成任务中也表现出强劲的性能,能够在连续合成步骤之间实现平滑过渡。