This paper presents a follow-up study to OpenAI's recent superalignment work on Weak-to-Strong Generalization (W2SG). Superalignment focuses on ensuring that high-level AI systems remain consistent with human values and intentions when dealing with complex, high-risk tasks. The W2SG framework has opened new possibilities for empirical research in this evolving field. Our study simulates two phases of superalignment under the W2SG framework: the development of general superhuman models and the progression towards superintelligence. In the first phase, based on human supervision, the quality of weak supervision is enhanced through a combination of scalable oversight and ensemble learning, reducing the capability gap between weak teachers and strong students. In the second phase, an automatic alignment evaluator is employed as the weak supervisor. By recursively updating this auto aligner, the capabilities of the weak teacher models are synchronously enhanced, achieving weak-to-strong supervision over stronger student models.We also provide an initial validation of the proposed approach for the first phase. Using the SciQ task as example, we explore ensemble learning for weak teacher models through bagging and boosting. Scalable oversight is explored through two auxiliary settings: human-AI interaction and AI-AI debate. Additionally, the paper discusses the impact of improved weak supervision on enhancing weak-to-strong generalization based on in-context learning. Experiment code and dataset will be released at https://github.com/ADaM-BJTU/W2SG.
翻译:本文是对OpenAI近期关于弱到强泛化(W2SG)超对齐工作的后续研究。超对齐旨在确保高级AI系统在处理复杂高风险任务时与人类价值观和意图保持一致。W2SG框架为这一不断发展的领域中的实证研究开辟了新可能性。本研究模拟了W2SG框架下的两个超对齐阶段:通用超人类模型的发展以及向超级智能的演进。第一阶段基于人类监督,通过可扩展监督与集成学习的结合增强弱监督质量,缩小弱教师模型与强学生模型之间的能力差距。第二阶段采用自动对齐评估器作为弱监督者,通过递归更新该自动对齐器同步提升弱教师模型的能力,实现对更强学生模型的弱到强监督。本文还针对第一阶段所提方法进行了初步验证。以SciQ任务为例,我们通过装袋法和提升法探索弱教师模型的集成学习,并通过人机交互与人工智能辩论两种辅助设置研究可扩展监督。此外,论文探讨了改进弱监督对基于上下文学习的弱到强泛化增强的影响。实验代码与数据集将发布于https://github.com/ADaM-BJTU/W2SG。