The emergence of pre-trained AI systems with powerful capabilities across a diverse and ever-increasing set of complex domains has raised a critical challenge for AI safety as tasks can become too complicated for humans to judge directly. Irving et al. [2018] proposed a debate method in this direction with the goal of pitting the power of such AI models against each other until the problem of identifying (mis)-alignment is broken down into a manageable subtask. While the promise of this approach is clear, the original framework was based on the assumption that the honest strategy is able to simulate deterministic AI systems for an exponential number of steps, limiting its applicability. In this paper, we show how to address these challenges by designing a new set of debate protocols where the honest strategy can always succeed using a simulation of a polynomial number of steps, whilst being able to verify the alignment of stochastic AI systems, even when the dishonest strategy is allowed to use exponentially many simulation steps.
翻译:预训练AI系统在日益多样化的复杂领域中展现出强大能力,这给AI安全带来了严峻挑战,因为当任务变得过于复杂时,人类难以直接进行判断。Irving等人[2018]提出了一种辩论方法,旨在让此类AI模型相互对抗,直至将识别(误)对齐的问题分解为可管理的子任务。尽管该方法前景广阔,但原有框架基于诚实策略能够模拟确定性AI系统指数步数的假设,这限制了其适用性。本文通过设计一组新的辩论协议来解决这些挑战:即使允许不诚实策略使用指数级模拟步数,诚实策略仍可通过多项式步数的模拟始终成功,同时能够验证随机AI系统的对齐性。