大型语言模型是否易受偏好颠覆攻击（PUA）？一种诊断偏好对齐与现实有效性权衡的因子分析方法论 (Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity) - 专知论文

会员服务 ·

0

因子 · 对齐 · 攻击 · 分析 · 有效性 ·

Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity

翻译：大型语言模型是否易受偏好颠覆攻击（PUA）？一种诊断偏好对齐与现实有效性权衡的因子分析方法论

Hongjun An,Yiliang Song,Jiangan Chen,Jiawei Shao,Chi Zhang,Xuelong Li

from arxiv, preprint

Large Language Model (LLM) training often optimizes for preference alignment, rewarding outputs that are perceived as helpful and interaction-friendly. However, this preference-oriented objective can be exploited: manipulative prompts can steer responses toward user-appeasing agreement and away from truth-oriented correction. In this work, we investigate whether aligned models are vulnerable to Preference-Undermining Attacks (PUA), a class of manipulative prompting strategies designed to exploit the model's desire to please user preferences at the expense of truthfulness. We propose a diagnostic methodology that provides a finer-grained and more directive analysis than aggregate benchmark scores, using a factorial evaluation framework to decompose prompt-induced shifts into interpretable effects of system objectives (truth- vs. preference-oriented) and PUA-style dialogue factors (directive control, personal derogation, conditional approval, reality denial) within a controlled $2 \times 2^4$ design. Surprisingly, more advanced models are sometimes more susceptible to manipulative prompts. Beyond the dominant reality-denial factor, we observe model-specific sign reversals and interactions with PUA-style factors, suggesting tailored defenses rather than uniform robustness. These findings offer a novel, reproducible factorial evaluation methodology that provides finer-grained diagnostics for post-training processes like RLHF, enabling better trade-offs in the product iteration of LLMs by offering a more nuanced understanding of preference alignment risks and the impact of manipulative prompts.

翻译：大型语言模型（LLM）的训练通常以偏好对齐为优化目标，奖励那些被认为有帮助且交互友好的输出。然而，这种以偏好为导向的目标可能被利用：操纵性提示可以引导模型响应倾向于取悦用户的附和，而非基于事实的纠正。在本研究中，我们探究了对齐模型是否易受偏好颠覆攻击（PUA）——一类旨在利用模型取悦用户偏好的倾向而牺牲真实性的操纵性提示策略。我们提出一种诊断方法，相比聚合基准分数，该方法能提供更细粒度、更具指导性的分析。该方法采用因子评估框架，在受控的 $2 \times 2^4$ 实验设计中，将提示引发的响应变化分解为系统目标（以事实为导向 vs. 以偏好为导向）和PUA风格对话因子（指令控制、人身贬损、条件性认可、现实否认）的可解释效应。令人惊讶的是，更先进的模型有时反而更容易受到操纵性提示的影响。除了占主导地位的现实否认因子外，我们还观察到模型特定的符号反转以及与PUA风格因子的交互作用，这表明需要定制化的防御策略而非统一的鲁棒性方案。这些发现提供了一种新颖、可复现的因子评估方法，为RLHF等训练后过程提供了更细粒度的诊断工具，通过更细致地理解偏好对齐风险及操纵性提示的影响，使LLM的产品迭代能够实现更好的权衡。

0

相关内容

【ICCV2025】具有局部对齐视觉-语言模型的可解释零样本学习

【ICCV2025】具有局部对齐视觉-语言模型的可解释零样本学习

专知会员服务

10+阅读 · 2025年7月1日

【ICML2025】大语言模型的有限理性：推理时的“满意化”对齐策略

【ICML2025】大语言模型的有限理性：推理时的“满意化”对齐策略

专知会员服务

11+阅读 · 2025年6月1日

【NeurIPS2024】迈向具有不完整数据的鲁棒多模态情感分析

【NeurIPS2024】迈向具有不完整数据的鲁棒多模态情感分析

专知会员服务

18+阅读 · 2024年10月2日

【NeurIPS2022】通过模型转换的可解释强化学习

【NeurIPS2022】通过模型转换的可解释强化学习

专知会员服务

38+阅读 · 2022年10月4日

【CVPR2022】MSDN: 零样本学习的互语义蒸馏网络

【CVPR2022】MSDN: 零样本学习的互语义蒸馏网络

专知会员服务

21+阅读 · 2022年3月8日

【ICML2021】因果匹配领域泛化

【ICML2021】因果匹配领域泛化

专知

12+阅读 · 2021年8月12日

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

专知

11+阅读 · 2020年8月28日

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

专知

13+阅读 · 2020年4月1日

【NeurIPS2019】图变换网络：Graph Transformer Network

【NeurIPS2019】图变换网络：Graph Transformer Network

专知

245+阅读 · 2019年11月18日

论文浅尝 | 当知识图谱遇上零样本学习——零样本学习综述

论文浅尝 | 当知识图谱遇上零样本学习——零样本学习综述

开放知识图谱

22+阅读 · 2018年9月26日

语义Web知识库补全关键技术研究

国家自然科学基金

18+阅读 · 2017年12月31日

含非正态及缺失数据的结构方程模型分析

国家自然科学基金

0+阅读 · 2015年12月31日

动态Gr？bner 基与GVW算法

国家自然科学基金

0+阅读 · 2014年12月31日

一般误差分布下若干半参数模型的复合分位数方法

国家自然科学基金

0+阅读 · 2014年12月31日

变换结构方程模型的非参数贝叶斯分析

国家自然科学基金

4+阅读 · 2014年12月31日

Metaphors are a Source of Cross-Domain Misalignment of Large Reasoning Models

Metaphors are a Source of Cross-Domain Misalignment of Large Reasoning Models

Arxiv

0+阅读 · 1月12日

DiffER: Diffusion Entity-Relation Modeling for Reversal Curse in Diffusion Large Language Models

Arxiv

0+阅读 · 1月12日

How Context Shapes Truth: Geometric Transformations of Statement-level Truth Representations in LLMs

Arxiv

0+阅读 · 1月10日

Left, Right, or Center? Evaluating LLM Framing in News Classification and Generation

Arxiv

0+阅读 · 1月9日

Dual-Phase LLM Reasoning: Self-Evolved Mathematical Frameworks

Arxiv

0+阅读 · 1月9日

VIP会员

文章信息

相关主题

相关VIP内容

【ICCV2025】具有局部对齐视觉-语言模型的可解释零样本学习

【ICCV2025】具有局部对齐视觉-语言模型的可解释零样本学习

专知会员服务

10+阅读 · 2025年7月1日

【ICML2025】大语言模型的有限理性：推理时的“满意化”对齐策略

【ICML2025】大语言模型的有限理性：推理时的“满意化”对齐策略

专知会员服务

11+阅读 · 2025年6月1日

【NeurIPS2024】迈向具有不完整数据的鲁棒多模态情感分析

【NeurIPS2024】迈向具有不完整数据的鲁棒多模态情感分析

专知会员服务

18+阅读 · 2024年10月2日

【NeurIPS2022】通过模型转换的可解释强化学习

【NeurIPS2022】通过模型转换的可解释强化学习

专知会员服务

38+阅读 · 2022年10月4日

【CVPR2022】MSDN: 零样本学习的互语义蒸馏网络

【CVPR2022】MSDN: 零样本学习的互语义蒸馏网络

专知会员服务

21+阅读 · 2022年3月8日

热门VIP内容

开通专知VIP会员享更多权益服务

具身智能中的语义生命周期：基于基础模型的获取、表征与存储

《TERRADEFENDER：一个用于战略战场情报准备的统一平台》

【NTU博士论文】视频生成新突破：从人脸说话视频到通用视频制作

麻省理工学院启动新项目为人工智能时代培训军事领导者

相关资讯

【ICML2021】因果匹配领域泛化

【ICML2021】因果匹配领域泛化

专知

12+阅读 · 2021年8月12日

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

专知

11+阅读 · 2020年8月28日

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

专知

13+阅读 · 2020年4月1日

【NeurIPS2019】图变换网络：Graph Transformer Network

【NeurIPS2019】图变换网络：Graph Transformer Network

专知

245+阅读 · 2019年11月18日

论文浅尝 | 当知识图谱遇上零样本学习——零样本学习综述

论文浅尝 | 当知识图谱遇上零样本学习——零样本学习综述

开放知识图谱

22+阅读 · 2018年9月26日

相关论文

Metaphors are a Source of Cross-Domain Misalignment of Large Reasoning Models

Metaphors are a Source of Cross-Domain Misalignment of Large Reasoning Models

Arxiv

0+阅读 · 1月12日

DiffER: Diffusion Entity-Relation Modeling for Reversal Curse in Diffusion Large Language Models

Arxiv

0+阅读 · 1月12日

How Context Shapes Truth: Geometric Transformations of Statement-level Truth Representations in LLMs

Arxiv

0+阅读 · 1月10日

Left, Right, or Center? Evaluating LLM Framing in News Classification and Generation

Arxiv

0+阅读 · 1月9日

Dual-Phase LLM Reasoning: Self-Evolved Mathematical Frameworks

Arxiv

0+阅读 · 1月9日

相关基金

语义Web知识库补全关键技术研究

国家自然科学基金

18+阅读 · 2017年12月31日

含非正态及缺失数据的结构方程模型分析

国家自然科学基金

0+阅读 · 2015年12月31日

动态Gr？bner 基与GVW算法

国家自然科学基金

0+阅读 · 2014年12月31日

一般误差分布下若干半参数模型的复合分位数方法

国家自然科学基金

0+阅读 · 2014年12月31日

变换结构方程模型的非参数贝叶斯分析

国家自然科学基金

4+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员