Current AI alignment methodologies rely on human-provided demonstrations or judgments, and the learned capabilities of AI systems would be upper-bounded by human capabilities as a result. This raises a challenging research question: How can we keep improving the systems when their capabilities have surpassed the levels of humans? This paper answers this question in the context of tackling hard reasoning tasks (e.g., level 4-5 MATH problems) via learning from human annotations on easier tasks (e.g., level 1-3 MATH problems), which we term as \textit{easy-to-hard generalization}. Our key insight is that an evaluator (reward model) trained on supervisions for easier tasks can be effectively used for scoring candidate solutions of harder tasks and hence facilitating easy-to-hard generalization over different levels of tasks. Based on this insight, we propose a novel approach to scalable alignment, which firstly trains the process-supervised reward models on easy problems (e.g., level 1-3), and then uses them to evaluate the performance of policy models on hard problems. We show that such \textit{easy-to-hard generalization from evaluators} can enable \textit{easy-to-hard generalizations in generators} either through re-ranking or reinforcement learning (RL). Notably, our process-supervised 7b RL model achieves an accuracy of 34.0\% on MATH500, despite only using human supervision on easy problems. Our approach suggests a promising path toward AI systems that advance beyond the frontier of human supervision.
翻译:当前AI对齐方法依赖人类提供的示范或判断,由此导致AI系统的习得能力受限于人类能力上限。这引出一个具有挑战性的研究问题:当系统能力超越人类水平时,我们如何持续改进系统?本文在解决困难推理任务(如MATH 4-5级问题)的背景下,通过利用人类对简单任务(如MATH 1-3级问题)的标注进行学习(我们称之为"易到难泛化")来回答该问题。我们的核心见解是:在简单任务监督下训练的评估器(奖励模型)可有效用于对困难任务的候选解进行评分,从而促进跨任务层级的易到难泛化。基于此见解,我们提出一种可扩展对齐的新方法:首先在简单问题(如1-3级)上训练过程监督奖励模型,然后利用这些模型评估策略模型在困难问题上的表现。研究表明,这种"来自评估器的易到难泛化"可通过重排序或强化学习实现"生成器中的易到难泛化"。值得注意的是,尽管仅使用简单问题的人类监督,我们的7B过程监督强化学习模型在MATH500上仍达到34.0%的准确率。该方法为构建超越人类监督前沿的AI系统指明了有前景的方向。