We prove that no reinforcement learning policy with confidence-gated autonomy can simultaneously achieve maximum helpfulness, optimal calibration, and full autonomy under rational oversight, whenever some tasks exceed the agent's reliable competence: the Behavioral Credibility Trilemma. The impossibility is geometric -- adding any non-affine autonomy incentive to a strictly proper scoring rule destroys strict properness, so an agent rewarded for both calibrated confidence and autonomous action systematically inflates its reported confidence on tasks below the principal's approval threshold. The Behavioral Perturbation Lemma quantifies the inflation (scaling as $w_A/(2 w_C)$ for the Brier score) and shows detection requires $Ω(1/Δ^2)$ observations. We prove the principal's optimal oversight rule is necessarily non-affine, making the impossibility unconditional and optimizer-independent across log-concave-density policy families. We formalize the Confidence-Gated Decision Problem, map existing methods onto the trilemma, and identify two constructive resolution pathways (commitment, domain separation). A 540-configuration Best-of-N experiment tests five pre-registered hypotheses, all strongly confirmed (effect sizes $d = 1.10$ to $5.32$), and adds a descriptive analysis of the achievable-$(H, C, A)$ surface geometry showing a plateau-truncated frontier consistent with the predicted inflation saturation.
翻译:我们证明,在任何存在任务超出智能体可靠能力范围的情况下,没有一种具有置信门控自主性的强化学习策略能在理性监督下同时实现最大帮助性、最优校准和完全自主性——此即行为可信三元困境。该不可能性本质上是几何性的——在严格适当的评分规则中加入任何非仿射自主激励都会破坏其严格适当性,因此,同时追求校准置信与自主行动的智能体,会在低于委托方审批阈值的任务上系统性夸大其报告的置信度。行为扰动引理量化了该膨胀量(对于Brier分数,其量级为$w_A/(2 w_C)$),并表明检测需要$Ω(1/Δ^2)$次观测。我们证明委托方的最优监督规则必然是非仿射的,这使得该不可能性在逻辑凸密度策略族内是无条件的且与优化器无关。我们形式化了置信门控决策问题,将现有方法映射到三元困境上,并确定了两种建设性的解决路径(承诺机制与领域分离)。一项包含540种配置的Best-of-N实验检验了五个预注册假设,所有假设均得到强力证实(效应量$d = 1.10$至$5.32$),并对可达$(H, C, A)$曲面几何进行了描述性分析,显示其呈现与预测膨胀饱和一致的平台截断前沿。