As AI capabilities increasingly surpass human proficiency in complex tasks, current alignment techniques, including SFT and RLHF, face fundamental challenges in ensuring reliable oversight. These methods rely on direct human assessment and become impractical when AI outputs exceed human cognitive thresholds. In response to this challenge, we explore two hypotheses: (1) \textit{Critique of critique can be easier than critique itself}, extending the widely-accepted observation that verification is easier than generation to the critique domain, as critique itself is a specialized form of generation; (2) \textit{This difficulty relationship holds recursively}, suggesting that when direct evaluation is infeasible, performing higher-order critiques (e.g., critique of critique of critique) offers a more tractable supervision pathway. We conduct Human-Human, Human-AI, and AI-AI experiments to investigate the potential of recursive self-critiquing for AI supervision. Our results highlight recursive critique as a promising approach for scalable AI oversight.
翻译:随着人工智能在复杂任务上的能力日益超越人类水平,包括监督微调(SFT)和基于人类反馈的强化学习(RLHF)在内的现有对齐技术,在确保可靠监督方面面临根本性挑战。这些方法依赖于直接的人类评估,当人工智能的输出超出人类认知阈值时,便不再适用。针对这一挑战,我们探讨了两个假设:(1)\textit{对批判的批判可能比批判本身更容易},这一假设将“验证比生成更容易”这一广为接受的观察推广至批判领域,因为批判本身是一种特殊形式的生成;(2)\textit{这种难度关系具有递归性},这意味着当直接评估不可行时,执行更高阶的批判(例如,对批判的批判的批判)提供了一条更可行的监督路径。我们通过人-人、人-人工智能以及人工智能-人工智能实验,研究了递归自批判在人工智能监督中的潜力。我们的结果表明,递归批判是实现可扩展人工智能监督的一种有前景的方法。