When does an LLM controller outperform rule-based traversal for knowledge graph exploration? We study this question through RLM-on-KG, a retrieval system that treats an LLM as an autonomous navigator over an RDF-encoded mention graph for grounded question answering. Unlike GraphRAG pipelines that rely on offline LLM indexing, RLM-on-KG performs entity-first, multi-hop exploration at query time using deterministic graph construction and a fixed tool set. Our central finding is a conditional advantage: the value of LLM control depends on evidence scatter and tool-calling sophistication. The paper's core claim is LLM control versus heuristic traversal, not a generic win over GraphRAG. On GraphRAG-Bench Novel (519 questions), Gemini 2.0 Flash achieves +2.47 pp F1 over a rule-based heuristic baseline (p < 0.0001), but only +0.16 pp over a GraphRAG-local variant (not significant). With a stronger controller, Claude Haiku 4.5, the gain over heuristic grows to +4.37 pp (p < 0.001) and extends to a +2.42 pp significant improvement over GraphRAG-local (p < 0.001). The gain is largest when gold evidence is scattered across 6-10 chunks (+3.21 pp) and smallest for concentrated evidence (+1.85 pp). Cross-scale validation on MuSiQue confirms that the LLM-over-heuristic advantage transfers, with expected attenuation on smaller per-question graphs. The core architectural insight is the separation of candidate discovery from ranking: the LLM adds value through exploration breadth, while final evidence selection is best handled by pure vector re-ranking. Beyond retrieval, exploration traces provide a proposed stress-test harness for structured data quality, yielding diagnostics for coverage, connectivity, provenance, and queryability.
翻译:何时大模型控制器会优于基于规则的遍历进行知识图谱探索?我们通过RLM-on-KG研究此问题,该系统将大模型视为RDF编码提及图上的自主导航器,以实现接地问答。与依赖离线大模型索引的GraphRAG流水线不同,RLM-on-KG在查询时采用确定性图构建和固定工具集,执行实体优先的多跳探索。我们的核心发现是一个条件优势:大模型控制的价值取决于证据分散度和工具调用复杂度。本文的核心主张是大模型控制与启发式遍历的对比,而非对GraphRAG的泛化优势。在GraphRAG-Bench Novel数据集(519个问题)上,Gemini 2.0 Flash相较于基于规则的启发式基线F1值提升+2.47个百分点(p < 0.0001),但相较于GraphRAG-local变体仅提升+0.16个百分点(不显著)。使用更强的控制器Claude Haiku 4.5时,相较于启发式的增益增至+4.37个百分点(p < 0.001),且对GraphRAG-local的显著提升扩大至+2.42个百分点(p < 0.001)。当黄金证据分散于6-10个分块时增益最大(+3.21个百分点),而证据集中时增益最小(+1.85个百分点)。在MuSiQue上的跨规模验证证实,大模型优于启发式的优势可迁移,且在小规模每问题图上出现预期衰减。核心架构洞见在于候选发现与排序的分离:大模型通过探索广度增加价值,而最终证据选择由纯向量重排序最佳处理。除检索外,探索轨迹为结构化数据质量提供了拟议的应力测试工具集,可生成覆盖度、连通性、溯源性和可查询性的诊断结果。