Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.
翻译:开源生成模型对学术界至关重要,它们既支持微调应用,又能在提出新模型时作为基准参照。然而,当前多数文本到音频模型均为私有模型,艺术家和研究人员难以基于其进行构建。本文阐述了一种基于知识共享数据训练的新型开源权重文本到音频模型的架构与训练流程。评估结果表明,该模型在多项指标上的性能均与当前最优模型具有竞争力。特别值得注意的是,所报告的FDopenl3结果(用于衡量生成音频的真实性)彰显了其在44.1kHz采样率下合成高质量立体声音频的潜力。