Text-to-Audio (TTA) aims to generate audio that corresponds to the given text description, playing a crucial role in media production. The text descriptions in TTA datasets lack rich variations and diversity, resulting in a drop in TTA model performance when faced with complex text. To address this issue, we propose a method called Portable Plug-in Prompt Refiner, which utilizes rich knowledge about textual descriptions inherent in large language models to effectively enhance the robustness of TTA acoustic models without altering the acoustic training set. Furthermore, a Chain-of-Thought that mimics human verification is introduced to enhance the accuracy of audio descriptions, thereby improving the accuracy of generated content in practical applications. The experiments show that our method achieves a state-of-the-art Inception Score (IS) of 8.72, surpassing AudioGen, AudioLDM and Tango.
翻译:文本到音频(TTA)技术旨在生成与给定文本描述相对应的音频,在媒体制作中扮演着关键角色。TTA数据集中的文本描述缺乏丰富的变体和多样性,导致TTA模型在面对复杂文本时性能下降。为解决此问题,我们提出了一种名为可移植插件式提示优化器的方法,该方法利用大型语言模型固有的关于文本描述的丰富知识,在不改变声学训练集的情况下,有效增强TTA声学模型的鲁棒性。此外,我们引入了模仿人类验证的思维链机制,以提升音频描述的准确性,从而在实际应用中提高生成内容的精确度。实验表明,我们的方法取得了8.72的先进初始分数,超越了AudioGen、AudioLDM和Tango。