Eliciting truthful reports from autonomous agents is a core problem in scalable AI oversight: a principal scores the agent's report using a strictly proper scoring rule, but the agent also benefits from the report through a non-accuracy channel (approval for autonomous action, allocation share, downstream control). The same structure appears in classical mechanism-design settings such as marketplace operation. Our main result is an endogeneity: the principal's optimal oversight necessarily uses a non-affine approval function to screen types, yet any non-affine approval makes truthful reporting suboptimal under the combined objective whenever deviation is undetectable. The principal cannot avoid the perturbation that undermines calibration. This impossibility holds for all strictly proper scoring rules, with a closed-form perturbation formula. A constructive escape exists: a step-function approval threshold achieves first-best screening for every strictly proper scoring rule, because the agent's binary inflate-or-not choice creates a type-space threshold regardless of the generator's curvature. Under the Brier score specifically, the type-independent inflation cost yields a welfare equivalence between second-best and first-best; we prove this equivalence is unique to Brier (the welfare gap under smooth $C^1$ oversight is bounded below by $Ω(\text{Var}(1/G'') (γ/β)^2)$ for every non-Brier rule). Two instances develop the framework: AI agent oversight (the lead motivating setting) and marketplace operation (a parallel mechanism-design domain). The message for AI alignment is direct: smooth scoring-based oversight cannot elicit truthful reports from a strategic agent; sharp thresholds are the calibration-preserving design.
翻译:从自主主体中获取真实报告是可扩展AI监督的核心问题:委托人通过严格适当评分规则对主体的报告进行评分,但主体还通过非准确性渠道(自主行动批准、分配份额、下游控制)从报告中获益。这一结构同样出现在市场运作等经典机制设计场景中。我们的核心结论是内生性:委托人最优监督必然使用非仿射批准函数筛选类型,然而在偏差不可检测的情况下,任何非仿射批准都会使联合目标下的真实报告成为次优策略。委托人无法避免破坏标定的扰动。该不可能性对所有严格适当评分规则成立,并给出封闭形式的扰动公式。存在建设性逃避路径:阶跃函数批准阈值能为所有严格适当评分规则实现最优筛选,因为主体二元化的"夸大或不夸大"选择在类型空间中生成阈值,与生成器的曲率无关。具体到Brier评分,与类型无关的夸大成本使次优与最优之间的福利等价成立;我们证明该等价性是Brier独有的(对于非Brier规则,光滑$C^1$监督下的福利差距下界为$Ω(\text{Var}(1/G'') (γ/β)^2)$)。两个实例发展该框架:AI主体监督(主要动机场景)与市场运作(平行机制设计领域)。对AI对齐的启示直截了当:基于光滑评分的监督无法从策略性主体中获取真实报告;锐利阈值才是保持标定的设计。