With the wide application of large language models (LLMs), the problems of bias and value inconsistency in sensitive domains have gradually emerged, especially in terms of race, society and politics. In this paper, we propose an adversarial alignment framework, which enhances the value consistency of the model in sensitive domains through continued pre-training, instruction fine-tuning and adversarial training. In adversarial training, we use the Attacker to generate controversial queries, the Actor to generate responses with value consistency, and the Critic to filter and ensure response quality. Furthermore, we train a Value-Consistent Large Language Model, VC-LLM, for sensitive domains, and construct a bilingual evaluation dataset in Chinese and English. The experimental results show that VC-LLM performs better than the existing mainstream models in both Chinese and English tests, verifying the effectiveness of the method. Warning: This paper contains examples of LLMs that are offensive or harmful in nature.
翻译:随着大型语言模型(LLMs)的广泛应用,其在敏感领域存在的偏见与价值观不一致问题逐渐显现,尤其在种族、社会与政治方面。本文提出一种对抗性对齐框架,通过持续预训练、指令微调与对抗性训练,增强模型在敏感领域的价值观一致性。在对抗性训练中,我们使用攻击者(Attacker)生成具有争议性的查询,行动者(Actor)生成具备价值观一致性的回应,评判者(Critic)则负责筛选并确保回应质量。此外,我们针对敏感领域训练了一个价值观一致的大型语言模型VC-LLM,并构建了中英双语评估数据集。实验结果表明,VC-LLM在中文与英文测试中均优于现有主流模型,验证了该方法的有效性。警告:本文包含具有冒犯性或有害性质的大型语言模型示例。