The rapid advancement of large language models (LLMs) presents both opportunities and challenges, particularly concerning unintentional generation of harmful and toxic responses. While the traditional alignment methods strive to steer LLMs towards desired performance and shield them from malicious content, this study proposes a novel alignment strategy rooted in mistake analysis by exposing LLMs to flawed outputs purposefully and then conducting a thorough assessment to fully comprehend internal reasons via natural language analysis. Thus, toxic responses can be transformed into instruction tuning corpus for model alignment, and LLMs can not only be deterred from generating flawed responses but also trained to self-criticize, leveraging its innate ability to discriminate toxic content. Experimental results demonstrate that the proposed method outperforms conventional alignment techniques for safety instruction following, while maintaining superior efficiency.
翻译:大型语言模型的快速进步带来了机遇与挑战,尤其是在意外生成有害及毒性回应方面。传统对齐方法致力于引导模型达到预期性能、防范恶意内容,而本研究提出一种基于错误分析的新型对齐策略——通过有意识地让模型接触错误输出,进而通过自然语言分析进行彻底评估,以全面理解其内部原因。这样,毒性回应可转化为用于模型对齐的指令微调语料。模型不仅能够避免生成错误回应,还能通过训练实现自我批判,利用其内在的毒性内容判别能力。实验结果表明,该方法在安全指令遵循任务上优于传统对齐技术,同时保持了优异的效率。