We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU. A key challenge in aligning TTA models lies in the difficulty of creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold-standard answers available for Large Language Models (LLMs). To address this, we propose CLAP-Ranked Preference Optimization (CRPO), a novel framework that iteratively generates and optimizes preference data to enhance TTA alignment. We demonstrate that the audio preference dataset generated using CRPO outperforms existing alternatives. With this framework, TangoFlux achieves state-of-the-art performance across both objective and subjective benchmarks. We open source all code and models to support further research in TTA generation.
翻译:我们介绍了TangoFlux,这是一个拥有5.15亿参数的高效文本到音频(TTA)生成模型,能够在单个A40 GPU上仅用3.7秒生成长达30秒的44.1kHz音频。对齐TTA模型的一个关键挑战在于创建偏好对的困难,因为TTA缺乏像大型语言模型(LLMs)那样可用的可验证奖励或黄金标准答案等结构化机制。为解决这一问题,我们提出了CLAP排序偏好优化(CRPO),这是一个新颖的框架,通过迭代生成和优化偏好数据来增强TTA对齐。我们证明了使用CRPO生成的音频偏好数据集优于现有替代方案。借助该框架,TangoFlux在客观和主观基准测试中均实现了最先进的性能。我们开源了所有代码和模型,以支持TTA生成的进一步研究。