Recent advances in audio understanding tasks leverage the reasoning capabilities of LLMs. However, adapting LLMs to learn audio concepts requires massive training data and substantial computational resources. To address these challenges, Retrieval-Augmented Generation (RAG) retrieves audio-text pairs from a knowledge base (KB) and augments them with query audio to generate accurate textual responses. In RAG, the relevance of the retrieved information plays a crucial role in effectively processing the input. In this paper, we analyze how different retrieval methods and knowledge bases impact the relevance of audio-text pairs and the performance of audio captioning with RAG. We propose generative pair-to-pair retrieval, which uses the generated caption as a text query to accurately find relevant audio-text pairs to the query audio, thereby improving the relevance and accuracy of retrieved information. Additionally, we refine the large-scale knowledge base to retain only audio-text pairs that align with the contextualized intents. Our approach achieves state-of-the-art results on benchmarks including AudioCaps, Clotho, and Auto-ACD, with detailed ablation studies validating the effectiveness of our retrieval and KB construction methods.
翻译:近年来,音频理解任务通过利用大型语言模型(LLM)的推理能力取得了显著进展。然而,使LLM适应学习音频概念需要大量训练数据和可观的计算资源。为应对这些挑战,检索增强生成(RAG)方法从知识库(KB)中检索音频-文本对,并将其与查询音频结合以生成准确的文本响应。在RAG中,检索信息的相关性对于有效处理输入至关重要。本文分析了不同检索方法和知识库如何影响音频-文本对的相关性以及基于RAG的音频描述性能。我们提出生成式配对到配对检索方法,该方法使用生成的描述作为文本查询,以精确查找与查询音频相关的音频-文本对,从而提高检索信息的相关性和准确性。此外,我们对大规模知识库进行精炼,仅保留与情境化意图相符的音频-文本对。我们的方法在AudioCaps、Clotho和Auto-ACD等基准测试中取得了最先进的性能,详细的消融研究验证了我们所提检索方法与知识库构建策略的有效性。