Scalable oversight, the process by which weaker AI systems supervise stronger ones, has been proposed as a key strategy to control future superintelligent systems. However, it is still unclear how scalable oversight itself scales. To address this gap, we propose a framework that quantifies the probability of successful oversight as a function of the capabilities of the overseer and the system being overseen. Specifically, our framework models oversight as a game between capability-mismatched players; the players have oversight-specific Elo scores that are a piecewise-linear function of their general intelligence, with two plateaus corresponding to task incompetence and task saturation. We validate our framework with a modified version of the game Nim and then apply it to four oversight games: Mafia, Debate, Backdoor Code and Wargames. For each game, we find scaling laws that approximate how domain performance depends on general AI system capability. We then build on our findings in a theoretical study of Nested Scalable Oversight (NSO), a process in which trusted models oversee untrusted stronger models, which then become the trusted models in the next step. We identify conditions under which NSO succeeds and derive numerically (and in some cases analytically) the optimal number of oversight levels to maximize the probability of oversight success. We also apply our theory to our four oversight games, where we find that NSO success rates at a general Elo gap of 400 are 13.5% for Mafia, 51.7% for Debate, 10.0% for Backdoor Code, and 9.4% for Wargames; these rates decline further when overseeing stronger systems.
翻译:可扩展监督,即较弱的人工智能系统监督较强系统的过程,已被提出作为控制未来超智能系统的关键策略。然而,可扩展监督本身如何扩展仍不明确。为填补这一空白,我们提出了一个量化监督成功概率的框架,该概率是监督者与被监督系统能力的函数。具体而言,我们的框架将监督建模为能力不匹配参与者之间的博弈;参与者具有监督特定的Elo评分,该评分是其通用智能的分段线性函数,其中包含两个平台期,分别对应任务能力不足和任务饱和。我们通过改进版的Nim游戏验证了该框架,并将其应用于四种监督博弈:Mafia、Debate、Backdoor Code和Wargames。针对每种博弈,我们发现了近似描述领域性能如何依赖于通用人工智能系统能力的缩放定律。在此基础上,我们通过理论研究了嵌套可扩展监督(NSO)——一种可信模型监督不可信的更强模型,后者在下一步中成为可信模型的过程。我们确定了NSO成功的条件,并通过数值计算(部分情况下解析推导)得出了最大化监督成功概率的最优监督层级数量。我们还将该理论应用于上述四种监督博弈,发现在通用Elo差距为400时,NSO成功率分别为:Mafia 13.5%、Debate 51.7%、Backdoor Code 10.0%、Wargames 9.4%;当监督更强系统时,这些成功率会进一步下降。