Large Language Model (LLM) training often optimizes for preference alignment, rewarding outputs that are perceived as helpful and interaction-friendly. However, this preference-oriented objective can be exploited: manipulative prompts can steer responses toward user-appeasing agreement and away from truth-oriented correction. In this work, we investigate whether aligned models are vulnerable to Preference-Undermining Attacks (PUA), a class of manipulative prompting strategies designed to exploit the model's desire to please user preferences at the expense of truthfulness. We propose a diagnostic methodology that provides a finer-grained and more directive analysis than aggregate benchmark scores, using a factorial evaluation framework to decompose prompt-induced shifts into interpretable effects of system objectives (truth- vs. preference-oriented) and PUA-style dialogue factors (directive control, personal derogation, conditional approval, reality denial) within a controlled $2 \times 2^4$ design. Surprisingly, more advanced models are sometimes more susceptible to manipulative prompts. Beyond the dominant reality-denial factor, we observe model-specific sign reversals and interactions with PUA-style factors, suggesting tailored defenses rather than uniform robustness. These findings offer a novel, reproducible factorial evaluation methodology that provides finer-grained diagnostics for post-training processes like RLHF, enabling better trade-offs in the product iteration of LLMs by offering a more nuanced understanding of preference alignment risks and the impact of manipulative prompts.
翻译:大型语言模型(LLM)的训练通常以偏好对齐为优化目标,奖励那些被认为有帮助且交互友好的输出。然而,这种以偏好为导向的目标可能被利用:操纵性提示可以引导模型响应倾向于取悦用户的附和,而非基于事实的纠正。在本研究中,我们探究了对齐模型是否易受偏好颠覆攻击(PUA)——一类旨在利用模型取悦用户偏好的倾向而牺牲真实性的操纵性提示策略。我们提出一种诊断方法,相比聚合基准分数,该方法能提供更细粒度、更具指导性的分析。该方法采用因子评估框架,在受控的 $2 \times 2^4$ 实验设计中,将提示引发的响应变化分解为系统目标(以事实为导向 vs. 以偏好为导向)和PUA风格对话因子(指令控制、人身贬损、条件性认可、现实否认)的可解释效应。令人惊讶的是,更先进的模型有时反而更容易受到操纵性提示的影响。除了占主导地位的现实否认因子外,我们还观察到模型特定的符号反转以及与PUA风格因子的交互作用,这表明需要定制化的防御策略而非统一的鲁棒性方案。这些发现提供了一种新颖、可复现的因子评估方法,为RLHF等训练后过程提供了更细粒度的诊断工具,通过更细致地理解偏好对齐风险及操纵性提示的影响,使LLM的产品迭代能够实现更好的权衡。