Due to recent advancements in Large Audio-Language Models (LALMs) that demonstrate remarkable performance across a range of sound-, speech- and music-related tasks, there is a growing interest in proposing benchmarks to assess these models. Existing benchmarks generally focus only on reasoning with internal knowledge, neglecting real-world scenarios that require external information grounding. To bridge this gap, we introduce AudioRAG, a novel benchmark designed to evaluate audio-based reasoning augmented by information retrieval in realistic web environments. This benchmark comprises both LLM-generated and manually curated question-answer pairs. Our evaluations reveal that even the state-of-the-art LALMs struggle to answer these questions. We therefore propose an agentic pipeline that integrates audio reasoning with retrieval-augmented generation, providing a stronger baseline for future research.
翻译:随着大规模音频-语言模型在各类声音、语音及音乐相关任务中展现出卓越性能,学界对构建评估此类模型的基准日益关注。现有基准通常仅关注基于内部知识的推理,忽视了需要外部信息锚定的真实场景。为填补这一空白,我们提出了AudioRAG——一个专为评估现实网络环境下基于信息检索增强的音频推理能力而设计的新型基准。该基准包含由大语言模型生成和人工精心构建的问答对。评估结果表明,即使当前最先进的大规模音频-语言模型也难以准确回答这些问题。为此,我们提出了一种融合音频推理与检索增强生成的智能体流程,为未来研究提供了更具竞争力的基线参考。