LLM agents excel when environments are mostly static and the needed information fits in a model's context window, but they often fail in open-ended investigations where explanations must be constructed by iteratively mining evidence from massive, heterogeneous operational data. These investigations exhibit hidden dependency structure: entities interact, signals co-vary, and the importance of a fact may only become clear after other evidence is discovered. Because the context window is bounded, agents must summarize intermediate findings before their significance is known, increasing the risk of discarding key evidence. ReAct-style agents are especially brittle in this regime. Their retrieve-summarize-reason loop makes conclusions sensitive to exploration order and introduces run-to-run non-determinism, producing a reliability gap where Pass-at-k may be high but Majority-at-k remains low. Simply sampling more rollouts or generating longer reasoning traces does not reliably stabilize results, since hypotheses cannot be autonomously checked as new evidence arrives and there is no explicit mechanism for belief bookkeeping and revision. In addition, ReAct entangles semantic reasoning with controller duties such as tool orchestration and state tracking, so execution errors and plan drift degrade reasoning while consuming scarce context. We address these issues by formulating investigation as abductive reasoning over a dependency graph and proposing EoG (Explanations over Graphs), a disaggregated framework in which an LLM performs bounded local evidence mining and labeling (cause vs symptom) while a deterministic controller manages traversal, state, and belief propagation to compute a minimal explanatory frontier. On a representative ITBench diagnostics task, EoG improves both accuracy and run-to-run consistency over ReAct baselines, including a 7x average gain in Majority-at-k entity F1.
翻译:大语言模型智能体在环境基本静态且所需信息能装入模型上下文窗口时表现出色,但在需要从海量异构操作数据中迭代挖掘证据以构建解释的开放式调查任务中往往失效。这类调查具有隐式依赖结构:实体相互关联、信号协同变化,且事实的重要性往往需在其他证据被发现后才得以显现。由于上下文窗口存在容量限制,智能体必须在明确信息重要性前对中间发现进行总结,这增加了丢弃关键证据的风险。ReAct范式智能体在此场景下尤为脆弱:其"检索-总结-推理"循环使得结论对探索顺序敏感,并引入运行间非确定性,导致可靠性鸿沟——即使Pass-at-k指标可能较高,Majority-at-k指标仍持续偏低。单纯增加采样轮次或生成更长推理轨迹并不能稳定结果,因为假设无法随新证据到达而自主验证,且缺乏显式的信念簿记与修正机制。此外,ReAct将语义推理与工具编排、状态跟踪等控制职责相耦合,导致执行错误与计划偏移在消耗有限上下文的同时损害推理质量。针对这些问题,我们将调查任务形式化为依赖图上的溯因推理,提出EoG(基于图的解释框架)——该解耦框架使大语言模型专注于有界的局部证据挖掘与标注(成因vs表征),而确定性控制器则管理图遍历、状态维护及信念传播,以计算最小解释边界。在典型ITBench诊断任务中,EoG相比ReAct基线在准确性与运行间一致性上均有提升,其中Majority-at-k实体F1指标平均提升达7倍。