Social alignment in AI systems aims to ensure that these models behave according to established societal values. However, unlike humans, who derive consensus on value judgments through social interaction, current language models (LMs) are trained to rigidly replicate their training corpus in isolation, leading to subpar generalization in unfamiliar scenarios and vulnerability to adversarial attacks. This work presents a novel training paradigm that permits LMs to learn from simulated social interactions. In comparison to existing methodologies, our approach is considerably more scalable and efficient, demonstrating superior performance in alignment benchmarks and human evaluations. This paradigm shift in the training of LMs brings us a step closer to developing AI systems that can robustly and accurately reflect societal norms and values.
翻译:人工智能系统中的社会对齐旨在确保模型行为符合既定社会价值观。然而,与人类通过社会互动获取价值判断共识不同,当前语言模型在孤立状态下严格复刻训练语料,导致在陌生场景中泛化能力不足且易受对抗攻击。本研究提出一种新型训练范式,使语言模型能够从模拟社交互动中学习。相较现有方法,本方案在可扩展性与效率方面显著提升,并在对齐基准测试与人工评估中展现出卓越性能。这一训练范式的转变推动我们向构建能稳健、精准反映社会规范与价值观的AI系统迈出关键一步。