Agentic AI systems capable of autonomous planning and extended environmental interaction pose a fundamental control problem: how can humans maintain meaningful oversight of systems that may exceed their own capabilities? Existing approaches to scalable oversight rely on complex assumptions, remain largely heuristic, or lack practical methods for sequential settings with statistical guarantees. We introduce Calibrated Collective Oversight (CCO), which aggregates diverse auxiliary scoring functions into a penalty measuring deviation from a conservative baseline. Inspired by Attainable Utility Preservation, CCO enables collective conservatism: actions face a penalty proportional to overseer concern, so high-utility actions are still selected when overseers find them unobjectionable and overridden only when concern accumulates. CCO calibrates this conservatism online using Conformal Decision Theory, ensuring that undesirable outcomes remain below a user-specified target threshold with finite-time bounds and no distributional assumptions. On a modified version of SWE-bench, weaker overseers successfully constrain an adversarially misaligned stronger agent; on MACHIAVELLI, CCO substantially reduces ethical violations while preserving reward. In both settings, empirical violation rates closely match the specified targets, as predicted by the theory.
翻译:具备自主规划与扩展环境交互能力的智能体AI系统引发了一个根本性控制问题:人类如何对可能超越自身能力的系统维持有意义的监督?现有可扩展监督方法依赖于复杂假设、大多停留在启发式层面,或缺乏具备统计保证的序列化场景实用方案。我们提出校准式集体监督(Calibrated Collective Oversight, CCO),该方法将多样化的辅助评分函数聚合为惩罚项,用于度量对保守基线的偏离。受可达效用保留理念启发,CCO实现了集体保守性:行动会面临与监督者关切程度成比例的惩罚,因此当监督者认为高效用行动无异议时仍会被采纳,仅在关切累积时被否决。CCO利用共形决策理论在线校准该保守性,确保不良结果在无需分布假设的情况下,以有限时间界保持低于用户指定的目标阈值。在改进版SWE-bench基准上,较弱监督者成功约束了对抗性错误对齐的较强智能体;在MACHIAVELLI基准中,CCO在保持奖励的同时显著降低了伦理违规。两种场景下的经验违规率均与理论预测的目标阈值高度吻合。