GRASP: Graph Agentic Search over Propositions for Multi-hop Question Answering

Agentic retrieval improves multi-hop question answering by giving language models autonomy to iteratively gather evidence. Recent work augments these systems with knowledge graphs for structured traversal, but this combination introduces significant cost: expensive graph construction at index time and compounding token usage at inference time. We introduce Graph Agentic Search over Propositions (GRASP), an agentic system that simultaneously optimizes for high accuracy and minimal token usage in multi-hop question answering. Rather than executing a rigid, singular query, GRASP actively coordinates its retrieval strategy by decomposing multi-hop queries into dependency-aware plans. This enables GRASP to dynamically scale the number of sub-agents according to the complexity of the problem. Each sub-agent resolves its single-hop query by exploring a novel three-layer hierarchical graph of entities, propositions, and passages, using the entity layer for targeted traversal and the proposition layer for high-recall passage retrieval via reciprocal-rank voting. We evaluate GRASP on MuSiQue, 2WikiMultihopQA, and HotpotQA under two settings: open-corpus retrieval and extended context reasoning (LongBench). GRASP achieves the highest QA accuracy in the open retrieval setting on MuSiQue and 2Wiki while using 40-50 percent fewer tokens than IRCoT+HippoRAG2. Furthermore, GRASP leads on EM and F1 across all three datasets in the LongBench setting while using 30 percent fewer tokens than the next most accurate method. Finally, we introduce success economy - the amortized token cost per correct answer, weighted by difficulty - and advocate for efficiency-aware evaluation as a standard practice for agentic QA.

翻译：智能检索通过赋予语言模型迭代收集证据的自主能力，提升了多跳问答性能。近期研究通过引入知识图谱增强此类系统以实现结构化遍历，但这种结合带来了显著成本：索引阶段昂贵的图构建开销以及推理阶段不断累积的令牌消耗。我们提出面向命题的图式智能搜索（GRASP），该系统在多跳问答中同步优化高精度与最低令牌消耗。GRASP并非执行僵化的单一查询，而是通过将多跳查询分解为依赖感知计划来主动协调检索策略，从而根据问题复杂度动态扩展子智能体数量。每个子智能体通过探索新型三层层次化图（实体层、命题层、段落层）来解析其单跳查询，其中实体层用于定向遍历，命题层则通过互逆排序投票实现高召回率段落检索。我们在MuSiQue、2WikiMultihopQA和HotpotQA上对GRASP进行了两种设置下的评估：开放语料检索与扩展上下文推理（LongBench）。在开放检索设置下，GRASP在MuSiQue和2Wiki上取得了最高QA准确率，同时令牌消耗比IRCoT+HippoRAG2减少40-50%。此外，在LongBench设置下，GRASP在三个数据集上的EM与F1指标均领先，且令牌消耗比次优方法减少30%。最后，我们提出"成功经济性"指标——按难度加权后每个正确答案的摊销令牌成本——并倡导将效率感知评估作为智能问答系统的标准实践。