Diagnosing failures in LLM agents remains largely manual. Practitioners inspect a small subset of execution traces, form ad-hoc hypotheses, and iterate. This process misses patterns that only emerge across trace populations and does not scale to production corpora where individual traces span tens of thousands of tokens. We formalize the problem of corpus-level trace diagnostics. Given a corpus of execution traces, the goal is to produce grounded natural-language insights that characterize systematic behavioral patterns across trace groups, each linked to supporting evidence. We present the Insights Generator (IG), a multi-agent system that answers diagnostic questions by proposing and testing hypotheses across the trace corpus to produce an evidence-backed insights report. We evaluate IG across qualitative and objective dimensions, spanning rubric-based report assessment and downstream performance improvements achieved by implementing IG insights. Human experts using IG reports improve scaffold performance by 30.4pp over the unmodified baseline scaffold, and coding agents leveraging IG-derived insights show consistent and stable gains. Across benchmarks, IG's scout-investigator architecture produces findings comparable in detection coverage to competing approaches, while domain experts rated IG reports as leading depth and evidence quality.
翻译:调试LLM智能体中的故障仍主要依赖人工操作。实践者需检查少量执行轨迹、形成临时假设并迭代优化。这一过程难以捕捉仅存在于轨迹群体中的模式,也无法扩展至单个轨迹包含数万词符的生产级语料库。我们正式定义了语料级轨迹诊断问题:给定执行轨迹语料库,目标是生成基于自然语言且可论证的洞察,用以描述跨轨迹群体的系统性行为模式,每条洞察均附有支撑证据链。我们提出洞察生成器(IG)——一种多智能体系统,通过跨轨迹语料库提出并验证假设来回答诊断性问题,最终生成附带证据支撑的洞察报告。我们从定性与定量维度评估IG,包括基于评分标准的报告质量评估以及通过实施IG洞察实现的性能提升。使用IG报告的人类专家将脚手架性能较未修改基线提升30.4个百分点,而利用IG衍生洞察的编程智能体展现出持续稳定的性能增益。在多项基准测试中,IG的侦察-调查架构在检测覆盖率上达到与竞争方法相当的成效,同时领域专家评定IG报告在洞察深度和证据质量方面处于领先水平。