Deployed RL agents operate in closed-loop systems where reliable performance depends on maintaining coherent coupling between observations, actions, and outcomes. Current monitoring approaches rely on reward and task metrics, measures that are reactive by design and blind to structural degradation that precedes performance collapse. We argue that deployment monitoring is fundamentally a question about uncertainty resolution: whether the agent's observations and actions continue to reduce uncertainty about outcomes, and whether outcomes constrain what the agent must have done. Information theory directly operationalizes this question, entropy quantifies uncertainty, and mutual information quantifies its resolution across the loop. We introduce Bipredictability (P), the fraction of the total uncertainty budget converted into shared predictability across the observation, action, outcome loop. A theoretical property is a provable classical upper bound P is less than or equal to 0.5, independent of domain, task, or agent, a structural consequence of Shannon entropy rather than an empirical observation. When agency is present, a penalty suppresses P strictly below this ceiling, confirmed at P equals 0.33 across trained agents. To operationalize P as a real time monitoring signal, we introduce the Information Digital Twin (IDT), an auxiliary architecture that computes P and its directional components from the observable interaction stream without access to model internals. Across 168 perturbation trials spanning eight perturbation types and two policy architectures, IDT based monitoring detected 89.3 percent of coupling degradations versus 44.0 percent for reward based monitoring, with 4.4 times lower median latency. These results establish Bipredictability as a principled, bounded, and computable prerequisite signal for closed loop self regulation in deployed reinforcement learning systems.
翻译:部署式强化学习智能体在闭环系统中运行,其可靠性能取决于观测、动作与结果之间保持连贯耦合。当前监控方法依赖奖励和任务指标,这些度量在本质上是反应式的,且对性能崩溃前发生的结构退化视而不见。我们认为部署监控本质上是一个关于不确定性消解的问题:智能体的观测和动作是否持续降低结果的不确定性,以及结果是否约束了智能体必须执行的行动。信息理论直接实现了这一问题的形式化——熵量化不确定性,互信息则量化其在闭环中的消解程度。我们提出双可预测性(P),即总不确定性预算中被转换为观测-动作-结果闭环共享可预测性的比例。其理论性质在于存在一个可证明的经典上界P≤0.5,该界限独立于领域、任务或智能体,是香农熵的结构性结果而非经验观测。当存在主体性时,惩罚项将严格压制P低于此上限,经训练智能体验证其值为P=0.33。为将P转化为实时监控信号,我们引入信息数字孪生(IDT)——一种无需访问模型内部即可从可观测交互流中计算P及其方向分量的辅助架构。在涵盖八种扰动类型与两种策略架构的168次扰动试验中,基于IDT的监控检测到89.3%的耦合退化(奖励基监控为44.0%),中位延迟降低4.4倍。这些结果确立了双可预测性作为部署式强化学习系统中闭环自调节所需的有界、可计算且具有原则性基础的先决信号。