Superintelligent Retrieval Agent: The Next Frontier of Agentic Retrieval

Retrieval-augmented agents are increasingly the interface to large knowledge bases, yet most treat retrieval as a black box: they issue exploratory queries, inspect snippets, and reformulate until evidence emerges. This resembles how a newcomer searches an unfamiliar database rather than how an expert navigates it with strong priors about terminology and likely evidence, causing extra retrieval rounds, latency, and poor recall. We introduce \textit{Superintelligent Retrieval Agent} (SIRA), which casts \emph{superintelligence} in retrieval as compressing multi-round exploratory search into a single corpus-discriminative retrieval action. SIRA does not merely ask which terms are relevant; it asks which terms separate the desired evidence from corpus-level confusers. Offline, an LLM enriches each document with missing search vocabulary; at query time, it predicts evidence vocabulary the query omits; and corpus statistics serve as tool calls that filter terms that are absent, overly common, or unlikely to create retrieval margin. The final step is a single weighted BM25 call combining the query with the validated expansion. Across ten BEIR benchmarks, SIRA achieves the strongest average retrieval performance in our comparison, beating dense retrievers, learned sparse retrievers, and LLM search-agent baselines while using no relevance labels or retriever fine-tuning. On downstream QA, its retrieval-only answer coverage exceeds recent RL-trained agentic QA systems on NQ and HotpotQA. We also introduce \textbf{BrowseComp-Wikipedia}, a hard-search benchmark of 232 BrowseComp-derived queries over a 25,587,229-document Wikipedia index. Even without index-time enrichment, using only grounded Wikipedia categories, SIRA outperforms multi-round Perplexity agents at every budget, reaching 9.70% Recall@1, 15.27% Recall@10, and 36.14% Recall@100.

翻译：基于检索增强的智能体正日益成为大型知识库的界面，然而大多数方法将检索视为一个黑箱：它们发出探索性查询、检查片段并重新表述，直至找到证据。这种方式类似于新手搜索不熟悉的数据库，而非专家凭借对术语和可能证据的强先验认知进行导航，导致额外的检索轮次、延迟和低召回率。我们提出《超级检索智能体》（SIRA），它将检索中的“超级智能”归结为将多轮探索性搜索压缩为一次具有语料区分性的检索行动。SIRA不仅询问哪些术语相关，还询问哪些术语能将目标证据与语料层面的混淆项区分开来。在离线阶段，大语言模型（LLM）为每篇文档补充缺失的搜索词汇；在查询阶段，它预测查询中遗漏的证据词汇；语料统计作为工具调用，过滤掉缺失、过于常见或不太可能产生检索边际的术语。最后一步是单次加权的BM25调用，将查询与验证后的扩展结合。在十个BEIR基准测试中，SIRA在比较中实现了最强的平均检索性能，超越了稠密检索器、学习型稀疏检索器和基于LLM的搜索智能体基线，且未使用相关标签或对检索器进行微调。在下游问答任务中，其仅依赖检索的答案覆盖率在NQ和HotpotQA上超过近期基于强化学习训练的智能体问答系统。我们还引入了**BrowseComp-Wikipedia**，这是一个包含232个BrowseComp派生查询的硬搜索基准，索引涵盖25,587,229篇维基百科文档。即使不进行索引时增强，仅使用基于维基百科类别的知识，SIRA在每种预算下均优于多轮Perplexity智能体，达到Recall@1为9.70%、Recall@10为15.27%、Recall@100为36.14%。