While alignment algorithms are now commonly used to tune pre-trained language models towards a user's preferences, we lack explanations for the underlying mechanisms in which models become ``aligned'', thus making it difficult to explain phenomena like jailbreaks. In this work we study a popular algorithm, direct preference optimization (DPO), and the mechanisms by which it reduces toxicity. Namely, we first study how toxicity is represented and elicited in a pre-trained language model, GPT2-medium. We then apply DPO with a carefully crafted pairwise dataset to reduce toxicity. We examine how the resulting model averts toxic outputs, and find that capabilities learned from pre-training are not removed, but rather bypassed. We use this insight to demonstrate a simple method to un-align the model, reverting it back to its toxic behavior.
翻译:尽管对齐算法现在常用于根据用户偏好调整预训练语言模型,但我们仍缺乏对模型实现“对齐”的底层机制的解释,这使得诸如“越狱”等现象难以被阐明。本文研究了一种流行算法——直接偏好优化(DPO),并探讨其降低毒性的具体机理。具体而言,我们首先研究毒性如何在预训练语言模型GPT2-medium中被表征和激发。随后,我们使用精心构建的成对偏好数据集应用DPO以降低毒性。我们考察了微调后模型如何规避毒性输出,发现其并未移除预训练阶段习得的能力,而是绕过了这些能力。基于这一发现,我们展示了一种简单的去对齐方法,使模型恢复至其原有的毒性行为。