We propose a general feedback-driven retrieval-augmented generation (RAG) approach that leverages Large Audio Language Models (LALMs) to address the missing or imperfect synthesis of specific sound events in text-to-audio (TTA) generation. Unlike previous RAG-based TTA methods that typically train specialized models from scratch, we utilize LALMs to analyze audio generation outputs, retrieve concepts that pre-trained models struggle to generate from an external database, and incorporate the retrieved information into the generation process. Experimental results show that our method not only enhances the ability of LALMs to identify missing sound events but also delivers improvements across different models, outperforming existing RAG-specialized approaches.
翻译:本文提出一种通用的反馈驱动检索增强生成方法,该方法利用大音频语言模型来解决文本到音频生成中特定声音事件缺失或合成不完善的问题。与以往基于检索增强生成的文本到音频方法通常需要从头训练专用模型不同,我们利用大音频语言模型分析音频生成输出,从外部数据库中检索预训练模型难以生成的概念,并将检索到的信息整合到生成过程中。实验结果表明,我们的方法不仅增强了大音频语言模型识别缺失声音事件的能力,还在不同模型上实现了性能提升,优于现有的专用检索增强生成方法。