Recent advances in audio understanding tasks leverage the reasoning capabilities of LLMs. However, adapting LLMs to learn audio concepts requires massive training data and substantial computational resources. To address these challenges, Retrieval-Augmented Generation (RAG) retrieves audio-text pairs from a knowledge base (KB) and augments them with query audio to generate accurate textual responses. In RAG, the relevance of the retrieved information plays a crucial role in effectively processing the input. In this paper, we analyze how different retrieval methods and knowledge bases impact the relevance of audio-text pairs and the performance of audio captioning with RAG. We propose generative pair-to-pair retrieval, which uses the generated caption as a text query to accurately find relevant audio-text pairs to the query audio, thereby improving the relevance and accuracy of retrieved information. Additionally, we refine the large-scale knowledge base to retain only audio-text pairs that align with the contextualized intents. Our approach achieves state-of-the-art results on benchmarks including AudioCaps, Clotho, and Auto-ACD, with detailed ablation studies validating the effectiveness of our retrieval and KB construction methods.
翻译:近年来,音频理解任务通过利用大语言模型的推理能力取得了显著进展。然而,使大语言模型适应学习音频概念需要海量训练数据和大量计算资源。为应对这些挑战,检索增强生成方法从知识库中检索音频-文本对,并将其与查询音频结合以生成准确的文本响应。在检索增强生成中,检索信息的相关性对于有效处理输入至关重要。本文分析了不同检索方法和知识库如何影响音频-文本对的相关性以及检索增强生成在音频描述任务中的性能。我们提出生成式配对检索方法,该方法使用生成的描述作为文本查询,以精确查找与查询音频相关的音频-文本对,从而提高检索信息的相关性和准确性。此外,我们对大规模知识库进行精炼,仅保留符合情境化意图的音频-文本对。我们的方法在AudioCaps、Clotho和Auto-ACD等基准测试中取得了最先进的结果,详细的消融研究验证了我们检索方法和知识库构建策略的有效性。