The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

Agentic artificial intelligence (AI) in organizations is a sequential decision problem constrained by reliability and oversight cost. When deterministic workflows are replaced by stochastic policies over actions and tool calls, the key question is not whether a next step appears plausible, but whether the resulting trajectory remains statistically supported, locally unambiguous, and economically governable. We develop a measure-theoretic Markov framework for this setting. The core quantities are state blind-spot mass B_n(tau), state-action blind mass B^SA_{pi,n}(tau), an entropy-based human-in-the-loop escalation gate, and an expected oversight-cost identity over the workflow visitation measure. We instantiate the framework on the Business Process Intelligence Challenge 2019 purchase-to-pay log (251,734 cases, 1,595,923 events, 42 distinct workflow actions) and construct a log-driven simulated agent from a chronological 80/20 split of the same process. The main empirical finding is that a large workflow can appear well supported at the state level while retaining substantial blind mass over next-step decisions: refining the operational state to include case context, economic magnitude, and actor class expands the state space from 42 to 668 and raises state-action blind mass from 0.0165 at tau=50 to 0.1253 at tau=1000. On the held-out split, m(s) = max_a pi-hat(a|s) tracks realized autonomous step accuracy within 3.4 percentage points on average. The same quantities that delimit statistically credible autonomy also determine expected oversight burden. The framework is demonstrated on a large-scale enterprise procurement workflow and is designed for direct application to engineering processes for which operational event logs are available.

翻译：组织中的强人工智能是一个受可靠性与监督成本约束的序贯决策问题。当确定性工作流被基于行动与工具调用的随机策略取代时，关键问题并非下一步骤是否看似合理，而是由此产生的轨迹是否在统计上具备支撑力、局部无歧义性，以及经济上的可治理性。我们为此场景建立了一个基于测度论的马尔可夫框架。其核心量包括状态盲区质量B_n(τ)、状态-行动盲区质量B^SA_{π,n}(τ)、基于熵的人员介入升级门控机制，以及基于工作流访问测度的期望监督成本恒等式。我们在业务流程智能挑战赛2019年采购到付款日志（251,734个案例，1,595,923个事件，42个不同工作流动作）上实例化该框架，并根据同一流程的时间顺序80/20划分构建了日志驱动的模拟智能体。主要实证发现是：大规模工作流在状态层面看似具备充分支撑，但在下一步决策层面仍保留大量盲区质量——将运营状态细化至包含案例上下文、经济规模与参与者类别后，状态空间从42扩展至668，状态-行动盲区质量在τ=50时为0.0165，至τ=1000时升至0.1253。在保留划分上，m(s)=max_a π̂(a|s)对实际自主步骤准确率的追踪误差平均在3.4个百分点内。界定统计可信自主性的相同量也决定了期望监督负担。该框架在大规模企业采购工作流中得到验证，并设计为可直接应用于具备运营事件日志的工程流程。