Big models, exemplified by Large Language Models (LLMs), are models typically pre-trained on massive data and comprised of enormous parameters, which not only obtain significantly improved performance across diverse tasks but also present emergent capabilities absent in smaller models. However, the growing intertwining of big models with everyday human lives poses potential risks and might cause serious social harm. Therefore, many efforts have been made to align LLMs with humans to make them better follow user instructions and satisfy human preferences. Nevertheless, `what to align with' has not been fully discussed, and inappropriate alignment goals might even backfire. In this paper, we conduct a comprehensive survey of different alignment goals in existing work and trace their evolution paths to help identify the most essential goal. Particularly, we investigate related works from two perspectives: the definition of alignment goals and alignment evaluation. Our analysis encompasses three distinct levels of alignment goals and reveals a goal transformation from fundamental abilities to value orientation, indicating the potential of intrinsic human values as the alignment goal for enhanced LLMs. Based on such results, we further discuss the challenges of achieving such intrinsic value alignment and provide a collection of available resources for future research on the alignment of big models.
翻译:大模型(以大型语言模型LLMs为例)通常指在海量数据上预训练、包含海量参数的模型,它们不仅能在各类任务中取得显著性能提升,还展现出小型模型所不具备的涌现能力。然而,随着大模型与人类日常生活日益交织,潜在风险随之而来,甚至可能引发严重的社会危害。为此,学界已开展大量工作使LLMs与人类对齐,以使其更好地遵循用户指令并满足人类偏好。但"与什么对齐"这一核心问题尚未得到充分讨论,不恰当的对齐目标甚至可能适得其反。本文对现有工作中不同对齐目标进行系统综述,追溯其演化路径以帮助识别最本质的目标。具体而言,我们从对齐目标的定义与对齐评估两个维度展开研究,涵盖三个不同层次的对齐目标,揭示出从基础能力到价值取向的目标转换趋势,表明内在人类价值观有望成为增强型LLMs的对齐目标。基于此,我们进一步探讨实现此类内在价值对齐面临的挑战,并为未来大模型对齐研究提供可用资源汇编。