HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval

Multimodal retrieval models fail on reasoning-intensive queries where images (diagrams, charts, screenshots) must be deeply integrated with text to identify relevant documents -- the best multimodal model achieves only 27.6 nDCG@10 on MM-BRIGHT, underperforming even strong text-only retrievers (32.2). We introduce \textbf{HIVE} (\textbf{H}ypothesis-driven \textbf{I}terative \textbf{V}isual \textbf{E}vidence Retrieval), a plug-and-play framework that injects explicit visual-text reasoning into a retriever via LLMs. HIVE operates in four stages: (1) initial retrieval over the corpus, (2) LLM-based compensatory query synthesis that explicitly articulates visual and logical gaps observed in top-$k$ candidates, (3) secondary retrieval with the refined query, and (4) LLM verification and reranking over the union of candidates. Evaluated on the multimodal-to-text track of MM-BRIGHT (2,803 real-world queries across 29 technical domains), HIVE achieves a new state-of-the-art aggregated nDCG@10 of \textbf{41.7} -- a \textbf{+9.5} point gain over the best text-only model (DiVeR: 32.2) and \textbf{+14.1} over the best multimodal model (Nomic-Vision: 27.6), where our reasoning-enhanced base retriever contributes 33.2 and the HIVE framework adds a further \textbf{+8.5} points -- with particularly strong results in visually demanding domains (Gaming: 68.2, Chemistry: 42.5, Sustainability: 49.4). Compatible with both standard and reasoning-enhanced retrievers, HIVE demonstrates that LLM-mediated visual hypothesis generation and verification can substantially close the multimodal reasoning gap in retrieval. https://github.com/mm-bright/multimodal-reasoning-retrieval

翻译：多模态检索模型在需要将图像（图表、截图）与文本深度整合以识别相关文档的推理密集型查询中表现不佳——最佳多模态模型在MM-BRIGHT上仅取得27.6 nDCG@10，甚至不如强文本检索器（32.2）。我们提出\textbf{HIVE}（\textbf{H}ypothesis-driven \textbf{I}terative \textbf{V}isual \textbf{E}vidence Retrieval），一种即插即用框架，通过大语言模型将显式视觉-文本推理注入检索器。HIVE分四阶段运行：(1) 语料库初始检索；(2) 基于LLM的补偿性查询合成，显式表述前$k$候选集中观察到的视觉与逻辑缺陷；(3) 基于优化查询的二次检索；(4) 对候选集并集进行LLM验证与重排序。在MM-BRIGHT多模态到文本赛道（涵盖29个技术领域的2,803个真实查询）上评估，HIVE以\textbf{41.7}的聚合nDCG@10创下新最优——较最佳纯文本模型（DiVeR: 32.2）提升\textbf{+9.5}点，较最佳多模态模型（Nomic-Vision: 27.6）提升\textbf{+14.1}点，其中推理增强型基座检索器贡献33.2，HIVE框架额外增加\textbf{+8.5}点——在视觉密集型领域（游戏：68.2，化学：42.5，可持续发展：49.4）表现尤为突出。HIVE兼容标准检索器与推理增强检索器，证明LLM介导的视觉假设生成与验证可显著弥合检索中的多模态推理鸿沟。https://github.com/mm-bright/multimodal-reasoning-retrieval