Alignment of large language models (LLMs) involves training models on preference-contrastive output pairs to adjust their responses according to human preferences. To obtain such contrastive pairs, traditional methods like RLHF and RLAIF rely on limited contrasting patterns, such as varying model variants or decoding temperatures. This singularity leads to two issues: (1) alignment is not comprehensive; and thereby (2) models are susceptible to jailbreaking attacks. To address these issues, we investigate how to construct more comprehensive and diversified contrasting patterns to enhance preference data (RQ1) and verify the impact of the diversification of contrasting patterns on model alignment (RQ2). For RQ1, we propose PopAlign, a framework that integrates diversified contrasting patterns across the prompt, model, and pipeline levels, introducing six contrasting strategies that do not require additional feedback labeling procedures. Regarding RQ2, we conduct thorough experiments demonstrating that PopAlign significantly outperforms existing methods, leading to more comprehensive alignment.
翻译:大型语言模型(LLM)的对齐涉及在偏好对比输出对上训练模型,以根据人类偏好调整其响应。为获取此类对比对,传统方法如RLHF和RLAIF依赖于有限的对比模式,例如改变模型变体或解码温度。这种单一性导致两个问题:(1)对齐不全面;以及(2)模型易受越狱攻击。为解决这些问题,我们研究了如何构建更全面和多样化的对比模式以增强偏好数据(RQ1),并验证对比模式多样化对模型对齐的影响(RQ2)。对于RQ1,我们提出了PopAlign框架,该框架在提示、模型和流程层面整合了多样化的对比模式,引入了六种无需额外反馈标注流程的对比策略。关于RQ2,我们进行了全面的实验,证明PopAlign显著优于现有方法,实现了更全面的对齐。