RLM-on-KG: Heuristics First, LLMs When Needed: Adaptive Retrieval Control over Mention Graphs for Scattered Evidence

When does an LLM controller outperform rule-based traversal for knowledge graph exploration? We study this question through RLM-on-KG, a retrieval system that treats an LLM as an autonomous navigator over an RDF-encoded mention graph for grounded question answering. Unlike GraphRAG pipelines that rely on offline LLM indexing, RLM-on-KG performs entity-first, multi-hop exploration at query time using deterministic graph construction and a fixed tool set. Our central finding is a conditional advantage: the value of LLM control depends on evidence scatter and tool-calling sophistication. The paper's core claim is LLM control versus heuristic traversal, not a generic win over GraphRAG. On GraphRAG-Bench Novel (519 questions), Gemini 2.0 Flash achieves +2.47 pp F1 over a rule-based heuristic baseline (p < 0.0001), but only +0.16 pp over a GraphRAG-local variant (not significant). With a stronger controller, Claude Haiku 4.5, the gain over heuristic grows to +4.37 pp (p < 0.001) and extends to a +2.42 pp significant improvement over GraphRAG-local (p < 0.001). The gain is largest when gold evidence is scattered across 6-10 chunks (+3.21 pp) and smallest for concentrated evidence (+1.85 pp). Cross-scale validation on MuSiQue confirms that the LLM-over-heuristic advantage transfers, with expected attenuation on smaller per-question graphs. The core architectural insight is the separation of candidate discovery from ranking: the LLM adds value through exploration breadth, while final evidence selection is best handled by pure vector re-ranking. Beyond retrieval, exploration traces provide a proposed stress-test harness for structured data quality, yielding diagnostics for coverage, connectivity, provenance, and queryability.

翻译：何时大模型控制器会优于基于规则的遍历进行知识图谱探索？我们通过RLM-on-KG研究此问题，该系统将大模型视为RDF编码提及图上的自主导航器，以实现接地问答。与依赖离线大模型索引的GraphRAG流水线不同，RLM-on-KG在查询时采用确定性图构建和固定工具集，执行实体优先的多跳探索。我们的核心发现是一个条件优势：大模型控制的价值取决于证据分散度和工具调用复杂度。本文的核心主张是大模型控制与启发式遍历的对比，而非对GraphRAG的泛化优势。在GraphRAG-Bench Novel数据集（519个问题）上，Gemini 2.0 Flash相较于基于规则的启发式基线F1值提升+2.47个百分点（p < 0.0001），但相较于GraphRAG-local变体仅提升+0.16个百分点（不显著）。使用更强的控制器Claude Haiku 4.5时，相较于启发式的增益增至+4.37个百分点（p < 0.001），且对GraphRAG-local的显著提升扩大至+2.42个百分点（p < 0.001）。当黄金证据分散于6-10个分块时增益最大（+3.21个百分点），而证据集中时增益最小（+1.85个百分点）。在MuSiQue上的跨规模验证证实，大模型优于启发式的优势可迁移，且在小规模每问题图上出现预期衰减。核心架构洞见在于候选发现与排序的分离：大模型通过探索广度增加价值，而最终证据选择由纯向量重排序最佳处理。除检索外，探索轨迹为结构化数据质量提供了拟议的应力测试工具集，可生成覆盖度、连通性、溯源性和可查询性的诊断结果。