In multimodal multi-hop question answering, we focus on the initial retrieval stage via two distinct tasks: (1) evidence set completion, retrieving missing evidence given context, and (2) sequential pool construction, iteratively building the top-$K$ pool from the scratch. Under these settings, we point out that conventional iterative retrieval frameworks often suffer from Semantic Anchoring, where previously fetched evidence traps the retriever and yields entity-centric redundancy. To break this trap, we propose GRAIL (Gap-aware Retrieval via Adaptive Implicit Localization), a paradigm that performs implicit query rewriting directly at the embedding level. By context-subtractive query steering, GRAIL excels at compositional cross-modal reasoning, while additive embedding updates show strength on localized information aggregation. By dynamically routing queries based on task type, our Hybrid Framework achieves a 40.3% macro-averaged performance gain on MultimodalQA. Extensive evaluations demonstrate that sequential GRAIL retrieves in a superior, noise-resilient manner, significantly expanding the search horizon through iterative gap-aware optimization.
翻译:在多模态多跳问答中,我们通过两个不同任务聚焦于初始检索阶段:(1) 证据集补全——在给定上下文中检索缺失证据;(2) 序列池构建——从零开始迭代构建前K个候选池。在这些设定下,我们指出现有迭代检索框架普遍存在"语义锚定"问题——先前获取的证据会束缚检索器,导致实体中心冗余。为突破这一困境,我们提出GRAIL(自适应隐式定位的间隙感知检索),该范式直接在嵌入层执行隐式查询重写。通过上下文减法查询引导,GRAIL在组合式跨模态推理上表现卓越,而加法式嵌入更新则擅长局部信息聚合。基于任务类型动态路由查询的混合框架,在MultimodalQA上实现了40.3%的宏平均性能提升。大量评估表明,序列化GRAIL能以更优的噪声鲁棒方式进行检索,通过迭代式间隙感知优化显著扩展搜索视野。