Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Collin Burns,Pavel Izmailov,Jan Hendrik Kirchner,Bowen Baker,Leo Gao,Leopold Aschenbrenner,Yining Chen,Adrien Ecoffet,Manas Joglekar,Jan Leike,Ilya Sutskever,Jeff Wu

Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior - for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models. We study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model? We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks. We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization. However, we are still far from recovering the full capabilities of strong models with naive finetuning alone, suggesting that techniques like RLHF may scale poorly to superhuman models without further work. We find that simple methods can often significantly improve weak-to-strong generalization: for example, when finetuning GPT-4 with a GPT-2-level supervisor and an auxiliary confidence loss, we can recover close to GPT-3.5-level performance on NLP tasks. Our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.

翻译：广泛使用的对齐技术（如基于人类反馈的强化学习，RLHF）依赖于人类监督模型行为的能力——例如评估模型是否忠实遵循指令或生成安全输出。然而，未来的超人类模型将展现出人类难以可靠评估的复杂行为模式；人类将仅能对超人类模型进行弱监督。我们研究该问题的类比场景：弱模型监督能否激发远强于自身的模型全部能力？为此，我们利用GPT-4系列中多个预训练语言模型，在自然语言处理（NLP）、国际象棋和奖励建模任务上进行测试。研究发现，当直接基于弱模型生成的标签对强预训练模型进行微调时，其表现始终优于弱监督者——我们将此现象称为弱到强泛化。但仅通过朴素微调方法远未恢复强模型的全部能力，这表明RLHF等技术在应用于超人类模型时可能面临扩展性挑战。我们发现简单方法常能显著提升弱到强泛化性能：例如，当使用GPT-2级别监督者配合辅助置信度损失对GPT-4进行微调时，可在NLP任务上恢复接近GPT-3.5级别的性能。我们的结果表明，当前已在解决超人类模型对齐这一根本性挑战方面取得可验证的实证进展。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/