Monitoring Agentic Systems Before They're Reliable

from arxiv, 9 pages, 2 figures, 3 tables. Accepted to the Workshop on Agentic Software Engineering (AgenticSE), co-located with ACM CAIS 2026 (non-archival)

Agentic systems entering production typically operate as partially integrated assemblies where structural defects, not task-level errors, dominate the failure landscape. At this maturity level, task-level error detection may be infeasible: structural failure modes mask the signal that task-level monitors are designed to detect.We present a monitoring and triage methodology that decomposes agentic system evaluation into three dimensions (quality, suitability, efficiency) at three monitoring scopes (within-run, cross-run, structural), using variance as a characterization signal. Findings are routed through severity classification adapted from FMEA, concentrating human attention on the subset that warrants investigation. We evaluate on a synthetic testbed of 220 runs across 120 document bundles with controlled error injection.Three results emerge. Monitor scope determines failure type: within-run monitors surface deterministic stage defects (CV = 0.02), cross-run monitors surface stochastic integration consequences (CV = 1.25, 24% at L2), and a structural monitor identifies an integration gap with perfect consistency (CV = 0.00). Injected task-level errors are indistinguishable from clean baselines, confirming structural defects mask task-level signal. Deterministic triage routes 97% of findings to automated tracking, leaving the 2% reflecting variable behavior for human investigation.We propose, on Stage 1 evidence, a maturity-staging model in which monitoring transitions from structural characterization to error detection to reliability tracking as integration defects resolve. The taxonomy, CV-based scope characterization, and severity model transfer architecturally to document-driven, multi-stage agentic workflows in regulated industries; specific calibrations are domain-specific. Deploy monitoring early: the first thing it finds is the most important thing to fix.

翻译：进入生产的代理系统通常以部分集成组件的形态运行，此时结构缺陷（而非任务级错误）主导了故障模式。在此成熟度阶段，任务级错误检测可能不可行：结构性故障模式掩盖了任务级监控器设计用于检测的信号。我们提出一种监控与分诊方法论，将代理系统评估分解为三个维度（质量、适用性、效率），并在三个监控范围（运行内、跨运行、结构级）内展开，以方差作为表征信号。评估结果通过改良自FMEA的严重性分类进行路由，将人类注意力集中于值得调查的子集。我们在含受控错误注入的120个文档包上的220次运行合成测试平台上进行评估。三个结果显现：监控范围决定了故障类型——运行内监控器暴露确定性阶段缺陷（CV=0.02），跨运行监控器暴露随机集成后果（CV=1.25，24%处于L2级别），结构级监控器以完美一致性识别集成缺口（CV=0.00）。注入的任务级错误与清洁基线无法区分，证实结构缺陷掩盖了任务级信号。确定性分诊将97%的发现路由至自动追踪，仅余2%反映可变行为的结果供人工调查。基于第一阶段证据，我们提出成熟度分期模型：随着集成缺陷的解决，监控依次从结构表征过渡至错误检测，最终实现可靠性追踪。该分类体系、基于CV的范围表征及严重性模型在架构上可迁移至受监管行业中基于文档的多阶段代理工作流；特定校准需依领域定制。尽早部署监控：其首要发现正是最值得修复的关键事项。