LLM agents excel when environments are mostly static and the needed information fits in a model's context window, but they often fail in open-ended investigations where explanations must be constructed by iteratively mining evidence from massive, heterogeneous operational data. These investigations exhibit hidden dependency structure: entities interact, signals co-vary, and the importance of a fact may only become clear after other evidence is discovered. Because the context window is bounded, agents must summarize intermediate findings before their significance is known, increasing the risk of discarding key evidence. ReAct-style agents are especially brittle in this regime. Their retrieve-summarize-reason loop makes conclusions sensitive to exploration order and introduces run-to-run non-determinism, producing a reliability gap where Pass-at-k may be high but Majority-at-k remains low. Simply sampling more rollouts or generating longer reasoning traces does not reliably stabilize results, since hypotheses cannot be autonomously checked as new evidence arrives and there is no explicit mechanism for belief bookkeeping and revision. In addition, ReAct entangles semantic reasoning with controller duties such as tool orchestration and state tracking, so execution errors and plan drift degrade reasoning while consuming scarce context. We address these issues by formulating investigation as abductive reasoning over a dependency graph and proposing EoG (Explanations over Graphs), a disaggregated framework in which an LLM performs bounded local evidence mining and labeling (cause vs symptom) while a deterministic controller manages traversal, state, and belief propagation to compute a minimal explanatory frontier. On a representative ITBench diagnostics task, EoG improves both accuracy and run-to-run consistency over ReAct baselines, including a 7x average gain in Majority-at-k entity F1.
翻译:大语言模型(LLM)智能体在环境基本静态且所需信息能容纳于模型上下文窗口时表现出色,但在需要从海量异构运营数据中迭代挖掘证据以构建解释的开放式调查任务中往往失效。此类调查具有隐含的依赖结构:实体间相互作用,信号协同变化,且事实的重要性往往需在其他证据被发现后方能显现。由于上下文窗口存在限制,智能体必须在明确其意义前对中间发现进行总结,从而增加了丢弃关键证据的风险。ReAct 式智能体在此类场景中尤为脆弱。其“检索-总结-推理”循环导致结论对探索顺序敏感,并引入运行间非确定性,造成可靠性鸿沟——即使 Pass-at-k 可能较高,Majority-at-k 仍保持低位。单纯增加采样轮次或生成更长的推理轨迹并不能可靠地稳定结果,因为假设无法在新证据到达时自主验证,且缺乏明确的信念簿记与修正机制。此外,ReAct 将语义推理与工具编排、状态跟踪等控制职责相耦合,导致执行错误与计划漂移在消耗宝贵上下文的同时损害推理质量。为解决这些问题,我们将调查任务形式化为依赖图上的溯因推理,并提出 EoG(基于图的解释框架),这是一个解耦的框架:其中 LLM 负责有界的局部证据挖掘与标注(原因 vs 症状),而确定性控制器则管理图遍历、状态维护及信念传播,以计算最小解释边界。在具有代表性的 ITBench 诊断任务上,EoG 在准确性与运行间一致性方面均优于 ReAct 基线,包括实现 Majority-at-k 实体 F1 分数平均 7 倍的提升。