Long-horizon omnimodal question answering answers questions by reasoning over text, images, audio, and video. Despite recent progress on OmniLLMs, low-resource long audio-video QA still suffers from costly dense encoding, weak fine-grained retrieval, limited proactive planning, and no clear end-to-end optimization. To address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning. It builds an image-audio retrieval-augmented generation module that lets an OmniLLM fetch short, relevant frames and audio snippets from external banks. Moreover, it uses an agent loop that plans, calls tools across turns, and merges retrieved evidence to answer complex queries. Furthermore, we apply group relative policy optimization to jointly improve tool use and answer quality over time. Experiments on OmniVideoBench, WorldSense, and Daily-Omni show that OmniRAG-Agent consistently outperforms prior methods under low-resource settings and achieves strong results, with ablations validating each component.
翻译:长视野全模态问答通过文本、图像、音频和视频进行推理来回答问题。尽管全模态大语言模型(OmniLLMs)近期取得了进展,但低资源长音频-视频问答仍然面临密集编码成本高昂、细粒度检索能力弱、主动规划有限以及缺乏清晰的端到端优化等问题。为解决这些问题,我们提出了OmniRAG-Agent,一种面向预算受限长音频-视频推理的智能体化全模态问答方法。该方法构建了一个图像-音频检索增强生成模块,使OmniLLM能够从外部知识库中获取简短、相关的视频帧和音频片段。此外,它采用了一个智能体循环,该循环能够进行规划、跨轮次调用工具,并融合检索到的证据以回答复杂查询。进一步地,我们应用了分组相对策略优化,以联合提升工具使用效率和随时间推移的答案质量。在OmniVideoBench、WorldSense和Daily-Omni数据集上的实验表明,OmniRAG-Agent在低资源设置下持续优于现有方法,并取得了强劲的性能,消融实验验证了每个组件的有效性。