It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models. Although prior methods have leveraged \textit{text-only language models} to augment and improve captions, such methods have limitations related to scale and coherence between audio and captions. In this work, we propose an audio captioning pipeline that uses an \textit{audio language model} to synthesize accurate and diverse captions for audio at scale. We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named \texttt{AF-AudioSet}, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions. Through systematic evaluations on AudioCaps and MusicCaps, we find leveraging our pipeline and synthetic captions leads to significant improvements on audio generation quality, achieving a new \textit{state-of-the-art}.
翻译:获取高质量训练数据(尤其是描述文本)是文本到音频模型面临的公开挑战。尽管现有方法已利用*纯文本语言模型*来增强和改进描述文本,但此类方法在规模以及音频与描述之间的连贯性方面存在局限。本研究提出一种音频描述生成流程,该流程利用*音频语言模型*大规模合成准确且多样化的音频描述。我们运用此流程为AudioSet数据集生成合成描述数据集(命名为`AF-AudioSet`),进而评估基于这些合成描述对文本到音频模型进行预训练的收益。通过在AudioCaps和MusicCaps数据集上的系统评估,我们发现采用本流程及合成描述能显著提升音频生成质量,实现了新的*最先进性能*。