Alignment is the most critical step in building large language models (LLMs) that meet human needs. With the rapid development of LLMs gradually surpassing human capabilities, traditional alignment methods based on human-annotation are increasingly unable to meet the scalability demands. Therefore, there is an urgent need to explore new sources of automated alignment signals and technical approaches. In this paper, we systematically review the recently emerging methods of automated alignment, attempting to explore how to achieve effective, scalable, automated alignment once the capabilities of LLMs exceed those of humans. Specifically, we categorize existing automated alignment methods into 4 major categories based on the sources of alignment signals and discuss the current status and potential development of each category. Additionally, we explore the underlying mechanisms that enable automated alignment and discuss the essential factors that make automated alignment technologies feasible and effective from the fundamental role of alignment.
翻译:对齐是构建符合人类需求的大语言模型(LLMs)中最关键的步骤。随着LLMs的快速发展逐渐超越人类能力,基于人工标注的传统对齐方法日益无法满足可扩展性需求。因此,迫切需要探索新的自动对齐信号来源与技术途径。本文系统综述了近期涌现的自动对齐方法,试图探究当LLMs能力超越人类后,如何实现有效、可扩展的自动对齐。具体而言,我们根据对齐信号的来源将现有自动对齐方法归纳为4大类,并探讨各类方法的现状与潜在发展。此外,我们深入探究了实现自动对齐的内在机制,并从对齐的根本作用出发,讨论了使自动对齐技术可行且有效的核心要素。