Reasoning Provenance for Autonomous AI Agents: Structured Behavioral Analytics Beyond State Checkpoints and Execution Traces

As AI agents transition from human-supervised copilots to autonomous platform infrastructure, the ability to analyze their reasoning behavior across populations of investigations becomes a pressing infrastructure requirement. Existing operational tooling addresses adjacent needs effectively: state checkpoint systems enable fault tolerance; observability platforms provide execution traces for debugging; telemetry standards ensure interoperability. What current systems do not natively provide as a first-class, schema-level primitive is structured reasoning provenance -- normalized, queryable records of why the agent chose each action, what it concluded from each observation, how each conclusion shaped its strategy, and which evidence supports its final verdict. This paper introduces the Agent Execution Record (AER), a structured reasoning provenance primitive that captures intent, observation, and inference as first-class queryable fields on every step, alongside versioned plans with revision rationale, evidence chains, structured verdicts with confidence scores, and delegation authority chains. We formalize the distinction between computational state persistence and reasoning provenance, argue that the latter cannot in general be faithfully reconstructed from the former, and show how AERs enable population-level behavioral analytics: reasoning pattern mining, confidence calibration, cross-agent comparison, and counterfactual regression testing via mock replay. We present a domain-agnostic model with extensible domain profiles, a reference implementation and SDK, and outline an evaluation methodology informed by preliminary deployment on a production platformized root cause analysis agent.

翻译：随着AI智能体从人工监督的副驾驶转变为自主平台基础设施，跨群体调查场景下分析其推理行为的能力成为一项迫切的基础设施需求。现有运维工具可有效满足相邻需求：状态检查点系统支持容错，可观测性平台提供执行轨迹用于调试，遥测标准确保互操作性。但当前系统未能原生提供的一级模式级原语是结构化推理溯源——即规范化的、可查询的记录，用以追溯智能体为何选择每个动作、从观测中得出何种结论、每个结论如何影响其策略，以及支撑最终裁决的证据链。本文提出智能体执行记录（AER），这是一种结构化推理溯源原语，将意图、观测和推理作为每个步骤中原生可查询的一级字段，并配套携带修订理由的版本化计划、证据链、附带置信度的结构化裁决以及授权委托链。我们形式化区分了计算状态持久化与推理溯源，论证后者通常无法从前端可靠重建，并展示AER如何支持群体级行为分析：推理模式挖掘、置信度校准、跨智能体比较，以及通过模拟回放进行反事实回归测试。我们提出一种领域无关的模型（含可扩展领域配置文件）、参考实现与SDK，并基于在生产级平台化根因分析智能体上的初步部署概述评估方法论。