With the development of AI-Generated Content (AIGC), text-to-audio models are gaining widespread attention. However, it is challenging for these models to generate audio aligned with human preference due to the inherent information density of natural language and limited model understanding ability. To alleviate this issue, we formulate the BATON, a framework designed to enhance the alignment between generated audio and text prompt using human preference feedback. Our BATON comprises three key stages: Firstly, we curated a dataset containing both prompts and the corresponding generated audio, which was then annotated based on human feedback. Secondly, we introduced a reward model using the constructed dataset, which can mimic human preference by assigning rewards to input text-audio pairs. Finally, we employed the reward model to fine-tune an off-the-shelf text-to-audio model. The experiment results demonstrate that our BATON can significantly improve the generation quality of the original text-to-audio models, concerning audio integrity, temporal relationship, and alignment with human preference.
翻译:随着人工智能生成内容(AIGC)的发展,文本到音频模型正受到广泛关注。然而,由于自然语言固有的信息密度和模型理解能力的限制,这些模型难以生成符合人类偏好的音频。为解决该问题,我们提出了BATON——一个利用人类偏好反馈增强生成音频与文本提示之间对齐能力的框架。BATON包含三个关键阶段:首先,我们构建了一个包含提示及其对应生成音频的数据集,并基于人类反馈对其进行标注;其次,利用构建的数据集引入奖励模型,该模型通过为输入的文本-音频对分配奖励值来模拟人类偏好;最后,我们采用该奖励模型对现成的文本到音频模型进行微调。实验结果表明,BATON在音频完整性、时序关系以及与人类偏好对齐方面,能够显著提升原始文本到音频模型的生成质量。