As AI agents transition from human-supervised copilots to autonomous platform infrastructure, the ability to analyze their reasoning behavior across populations of investigations becomes a pressing infrastructure requirement. Existing operational tooling addresses adjacent needs effectively: state checkpoint systems enable fault tolerance; observability platforms provide execution traces for debugging; telemetry standards ensure interoperability. What current systems do not natively provide as a first-class, schema-level primitive is structured reasoning provenance -- normalized, queryable records of why the agent chose each action, what it concluded from each observation, how each conclusion shaped its strategy, and which evidence supports its final verdict. This paper introduces the Agent Execution Record (AER), a structured reasoning provenance primitive that captures intent, observation, and inference as first-class queryable fields on every step, alongside versioned plans with revision rationale, evidence chains, structured verdicts with confidence scores, and delegation authority chains. We formalize the distinction between computational state persistence and reasoning provenance, argue that the latter cannot in general be faithfully reconstructed from the former, and show how AERs enable population-level behavioral analytics: reasoning pattern mining, confidence calibration, cross-agent comparison, and counterfactual regression testing via mock replay. We present a domain-agnostic model with extensible domain profiles, a reference implementation and SDK, and outline an evaluation methodology informed by preliminary deployment on a production platformized root cause analysis agent.
翻译:随着AI智能体从人类监督的副驾角色转向自主平台基础设施,在群体调查层面分析其推理行为的能力成为迫切的基础设施需求。现有运维工具能有效处理相邻需求:状态检查点系统实现容错机制;可观测性平台提供用于调试的执行轨迹;遥测标准确保互操作性。当前系统未能原生提供的头等模式级原语是结构化推理溯源——将智能体选择每个动作的原因、从每个观察中得出的结论、每个结论如何影响其策略、以及支持最终裁决的证据,转化为规范化、可查询的记录。本文提出智能体执行记录(AER),这是一种结构化推理溯源原语,将意图、观察和推理作为每个步骤中的头等可查询字段进行捕获,同时包含带修订依据的版本化计划、证据链、带置信分数的结构化裁决,以及授权链。我们形式化界定了计算状态持久化与推理溯源之间的区别,论证后者通常无法从前者忠实地重建,并展示AER如何实现群体级行为分析:推理模式挖掘、置信度校准、跨智能体比较,以及通过模拟回放进行反事实回归测试。我们提出一个带有可扩展领域配置文件的领域无关模型、一个参考实现及SDK,并概述基于某生产级根因分析智能体平台初步部署所形成的评估方法论。