Universal Multimodal Retrieval (UMR) seeks any-to-any search across text and vision, yet modern embedding models remain brittle when queries require latent reasoning (e.g., resolving underspecified references or matching compositional constraints). We argue this brittleness is often data-induced: when images carry "silent" evidence and queries leave key semantics implicit, a single embedding pass must both reason and compress, encouraging spurious feature matching. We propose a data-centric framework that decouples these roles by externalizing reasoning before retrieval. Using a strong Vision--Language Model, we make implicit semantics explicit by densely captioning visual evidence in corpus entries, resolving ambiguous multimodal references in queries, and rewriting verbose instructions into concise retrieval constraints. Inference-time enhancement alone is insufficient; the retriever must be trained on these semantically dense representations to avoid distribution shift and fully exploit the added signal. Across M-BEIR, our reasoning-augmented training method yields consistent gains over strong baselines, with ablations showing that corpus enhancement chiefly benefits knowledge-intensive queries while query enhancement is critical for compositional modification requests. We publicly release our code at https://github.com/AugmentedRetrieval/ReasoningAugmentedRetrieval.
翻译:通用多模态检索旨在实现文本与视觉之间的任意到任意搜索,然而当查询需要潜在推理(例如解析未明确指定的指代或匹配组合约束)时,现代嵌入模型仍表现出脆弱性。我们认为这种脆弱性通常由数据引起:当图像携带“隐性”证据且查询使关键语义隐式化时,单次嵌入过程必须同时完成推理与压缩,从而易引发虚假特征匹配。本文提出一种以数据为中心的框架,通过在检索前将推理过程外部化,实现两者的解耦。利用强大的视觉-语言模型,我们通过以下方式显式化隐式语义:对语料库条目中的视觉证据进行密集描述、解析查询中模糊的多模态指代,并将冗长的指令重写为简洁的检索约束。仅进行推理时增强并不充分;检索器必须在这些语义密集的表示上进行训练,以避免分布偏移并充分利用新增信号。在M-BEIR基准测试中,我们的推理增强训练方法相较于强基线模型取得持续提升,消融实验表明:语料库增强主要惠及知识密集型查询,而查询增强对于组合式修改请求至关重要。我们在https://github.com/AugmentedRetrieval/ReasoningAugmentedRetrieval公开了代码。