We present a theoretical framework showing that popular LLM alignment methods, including RLHF and its variants, can be understood as divergence estimators between aligned (safe or preferred) and unaligned (harmful or less preferred) distributions. This perspective explains the emergence of separation in the latent space between safe and harmful prompts after alignment. As an application of our general divergence framework, we propose KLDO, a novel KL divergence-based alignment method, and empirically validate its effectiveness. We further show that using compliance-refusal datasets, rather than standard preference-based datasets, leads to stronger separation and improved safety alignment. Finally, to quantify the separation effect, we propose a distance-based metric in the prompt representation space, which also acts as a statistically significant indicator for model safety.
翻译:我们提出一个理论框架,表明包括RLHF及其变体在内的主流大语言模型对齐方法,均可理解为对齐(安全或偏好)分布与未对齐(有害或次优)分布之间的散度估计器。这一视角解释了为何对齐后安全提示与有害提示在隐空间中会出现分离现象。作为我们通用散度框架的应用,我们提出了KLDO——一种基于KL散度的新型对齐方法,并通过实验验证了其有效性。我们进一步证明,相较于标准的基于偏好的数据集,使用合规-拒绝数据集能够产生更强的分离效果并提升安全对齐性能。最后,为量化分离效应,我们在提示表示空间中提出了一种基于距离的度量指标,该指标亦可作为模型安全性的统计显著指示器。