The rapid development of large language models (LLMs) has not only provided numerous opportunities but also presented significant challenges. This becomes particularly evident when LLMs inadvertently generate harmful or toxic content, either unintentionally or because of intentional inducement. Existing alignment methods usually direct LLMs toward the favorable outcomes by utilizing human-annotated, flawless instruction-response pairs. Conversely, this study proposes a novel alignment technique based on mistake analysis, which deliberately exposes LLMs to erroneous content to learn the reasons for mistakes and how to avoid them. In this case, mistakes are repurposed into valuable data for alignment, effectively helping to avoid the production of erroneous responses. Without external models or human annotations, our method leverages a model's intrinsic ability to discern undesirable mistakes and improves the safety of its generated responses. Experimental results reveal that our method outperforms existing alignment approaches in enhancing model safety while maintaining the overall utility.
翻译:大型语言模型的快速发展不仅提供了众多机遇,也带来了显著挑战。当模型无意中或因故意诱导而生成有害或有毒内容时,这一问题尤为突出。现有对齐方法通常利用人工标注的完美指令-回答对,引导模型产生理想结果。与此相反,本研究提出了一种基于错误分析的新型对齐技术,该方法刻意让模型接触错误内容,使其学习错误原因及规避方法。在此过程中,错误被转化为有价值的对齐数据,有效帮助避免生成错误回答。无需外部模型或人工标注,我们的方法利用模型自身识别不良错误的能力,提升了生成回答的安全性。实验结果表明,在保持模型整体性能的同时,我们的方法在增强模型安全性方面优于现有对齐方法。