Text-to-audio generation (TTA) produces audio from a text description, learning from pairs of audio samples and hand-annotated text. However, commercializing audio generation is challenging as user-input prompts are often under-specified when compared to text descriptions used to train TTA models. In this work, we treat TTA models as a ``blackbox'' and address the user prompt challenge with two key insights: (1) User prompts are generally under-specified, leading to a large alignment gap between user prompts and training prompts. (2) There is a distribution of audio descriptions for which TTA models are better at generating higher quality audio, which we refer to as ``audionese''. To this end, we rewrite prompts with instruction-tuned models and propose utilizing text-audio alignment as feedback signals via margin ranking learning for audio improvements. On both objective and subjective human evaluations, we observed marked improvements in both text-audio alignment and music audio quality.
翻译:文本到音频生成(TTA)通过文本描述生成音频,从音频样本和人工标注文本的对中学习。然而,用户在输入提示时往往比训练TTA模型所用的文本描述更不明确,这使得音频生成的商业化面临挑战。本研究将TTA模型视为“黑箱”,通过两个关键见解应对用户提示挑战:(1)用户提示通常是不明确的,导致用户提示与训练提示之间存在较大的对齐差距;(2)存在一类音频描述(我们称之为“音频语言”),TTA模型能更好地生成高质量音频。为此,我们使用指令微调模型重写提示,并提出利用文本-音频对齐作为反馈信号,通过边际排序学习改进音频质量。在客观和主观人类评估中,我们观察到文本-音频对齐和音乐音频质量的显著提升。