Multimodal retrieval systems struggle to resolve image-text queries against text-only corpora: the best vision-language encoder achieves only 27.6 nDCG@10 on MM-BRIGHT, underperforming strong text-only retrievers. We argue the bottleneck is not the retriever but the query -- raw multimodal queries entangle visual descriptions, conversational noise, and retrieval intent in ways that systematically degrade embedding similarity. We present \textbf{BRIDGE}, a two-component system that resolves this mismatch without multimodal encoders. \textbf{FORGE} (\textbf{F}ocused Retrieval Query Generato\textbf{r}) is a query alignment model trained via reinforcement learning, which distills noisy multimodal queries into compact, retrieval-optimized search strings. \textbf{LENS} (\textbf{L}anguage-\textbf{E}nhanced \textbf{N}eural \textbf{S}earch) is a reasoning-enhanced dense retriever fine-tuned on reasoning-intensive retrieval data to handle the intent-rich queries FORGE produces. Evaluated on MM-BRIGHT (2,803 queries, 29 domains), BRIDGE achieves \textbf{29.7} nDCG@10, surpassing all multimodal encoder baselines including Nomic-Vision (27.6). When FORGE is applied as a plug-and-play aligner on top of Nomic-Vision, the combined system reaches \textbf{33.3} nDCG@10 -- exceeding the best text-only retriever (32.2) -- demonstrating that \textit{query alignment} is the key bottleneck in multimodal-to-text retrieval. https://github.com/mm-bright/multimodal-reasoning-retrieval
翻译:多模态检索系统在处理图像-文本查询与纯文本语料库的对齐时面临挑战:当前最佳视觉-语言编码器在MM-BRIGHT数据集上仅达到27.6的nDCG@10,表现不及强基线的纯文本检索器。我们认为瓶颈不在于检索器,而在于查询本身——原始多模态查询将视觉描述、对话噪声与检索意图交织在一起,系统性地损害了嵌入相似性。本文提出**BRIDGE**,一种无需多模态编码器的双组件系统。**FORGE**(聚焦检索查询生成器)是通过强化学习训练的查询对齐模型,能将嘈杂的多模态查询蒸馏为紧凑的、经检索优化的搜索字符串。**LENS**(语言增强神经搜索)是基于推理增强的密集型检索器,在推理密集型检索数据上微调,以处理FORGE生成的高意图查询。在MM-BRIGHT(包含2,803条查询、29个领域)上的评估表明,BRIDGE达到**29.7**的nDCG@10,超越所有多模态编码基线(包括Nomic-Vision的27.6)。将FORGE作为即插即用对齐器应用于Nomic-Vision后,组合系统达到**33.3**的nDCG@10——超过最佳纯文本检索器(32.2)——证明**查询对齐**是多模态到文本检索的关键瓶颈。