Big models, exemplified by Large Language Models (LLMs), are models typically pre-trained on massive data and comprised of enormous parameters, which not only obtain significantly improved performance across diverse tasks but also present emergent capabilities absent in smaller models. However, the growing intertwining of big models with everyday human lives poses potential risks and might cause serious social harm. Therefore, many efforts have been made to align LLMs with humans to make them better follow user instructions and satisfy human preferences. Nevertheless, `what to align with' has not been fully discussed, and inappropriate alignment goals might even backfire. In this paper, we conduct a comprehensive survey of different alignment goals in existing work and trace their evolution paths to help identify the most essential goal. Particularly, we investigate related works from two perspectives: the definition of alignment goals and alignment evaluation. Our analysis encompasses three distinct levels of alignment goals and reveals a goal transformation from fundamental abilities to value orientation, indicating the potential of intrinsic human values as the alignment goal for enhanced LLMs. Based on such results, we further discuss the challenges of achieving such intrinsic value alignment and provide a collection of available resources for future research on the alignment of big models.
翻译:大模型(以大型语言模型LLMs为例)通常是在海量数据上预训练并包含海量参数的模型,这不仅使其在各种任务中性能显著提升,还涌现出小型模型所不具备的能力。然而,大模型与人类日常生活的日益交织带来了潜在风险,可能造成严重的社会危害。因此,学界已投入大量工作使LLMs与人类对齐,以使其更好地遵循用户指令并满足人类偏好。但"与什么对齐"这一问题尚未得到充分探讨,不当的对齐目标甚至可能适得其反。本文对现有工作中不同的对齐目标进行了全面综述,追溯其演化路径以帮助识别最本质的目标。具体而言,我们从两个维度考察相关研究:对齐目标的定义与对齐评估。我们的分析涵盖三个层次的对齐目标,揭示出从基础能力到价值导向的目标转化过程,表明内在人类价值作为增强型LLMs对齐目标的潜力。基于这些发现,我们进一步探讨了实现此类内在价值对齐所面临的挑战,并提供了一系列可用资源,以期助力大模型对齐领域的未来研究。