Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior - for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models. We study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model? We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks. We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization. However, we are still far from recovering the full capabilities of strong models with naive finetuning alone, suggesting that techniques like RLHF may scale poorly to superhuman models without further work. We find that simple methods can often significantly improve weak-to-strong generalization: for example, when finetuning GPT-4 with a GPT-2-level supervisor and an auxiliary confidence loss, we can recover close to GPT-3.5-level performance on NLP tasks. Our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.
翻译:广泛使用的对齐技术(如基于人类反馈的强化学习,RLHF)依赖于人类监督模型行为的能力——例如评估模型是否忠实遵循指令或生成安全输出。然而,未来的超人类模型将展现出人类难以可靠评估的复杂行为模式;人类将仅能对超人类模型进行弱监督。我们研究该问题的类比场景:弱模型监督能否激发远强于自身的模型全部能力?为此,我们利用GPT-4系列中多个预训练语言模型,在自然语言处理(NLP)、国际象棋和奖励建模任务上进行测试。研究发现,当直接基于弱模型生成的标签对强预训练模型进行微调时,其表现始终优于弱监督者——我们将此现象称为弱到强泛化。但仅通过朴素微调方法远未恢复强模型的全部能力,这表明RLHF等技术在应用于超人类模型时可能面临扩展性挑战。我们发现简单方法常能显著提升弱到强泛化性能:例如,当使用GPT-2级别监督者配合辅助置信度损失对GPT-4进行微调时,可在NLP任务上恢复接近GPT-3.5级别的性能。我们的结果表明,当前已在解决超人类模型对齐这一根本性挑战方面取得可验证的实证进展。