As users increasingly turn to LLMs for practical and personal advice, they become vulnerable to subtle steering toward hidden incentives misaligned with their own interests. While existing NLP research has benchmarked manipulation detection, these efforts often rely on simulated debates and remain fundamentally decoupled from actual human belief shifts in real-world scenarios. We introduce PUPPET, a theoretical taxonomy and resource that bridges this gap by focusing on the moral direction of hidden incentives in everyday, advice-giving contexts. We provide an evaluation dataset of N=1,035 human-LLM interactions, where we measure users' belief shifts. Our analysis reveals a critical disconnect in current safety paradigms: while models can be trained to detect manipulative strategies, they do not correlate with the magnitude of resulting belief change. As such, we define the task of belief shift prediction and show that while state-of-the-art LLMs achieve moderate correlation (r=0.3-0.5), they systematically underestimate the intensity of human belief susceptibility. This work establishes a theoretically grounded and behaviorally validated foundation for AI social safety efforts by studying incentive-driven manipulation in LLMs during everyday, practical user queries.
翻译:随着用户越来越多地转向大语言模型(LLMs)寻求实用和个人建议,他们容易受到微妙引导,从而偏离自身利益,朝向与自身利益不一致的隐藏动机。尽管现有的自然语言处理(NLP)研究已对操控检测进行了基准测试,但这些努力通常依赖于模拟辩论,并且在根本上与真实场景中人类实际信念变化脱节。我们提出了PUPPET,这是一个理论分类法与资源,通过聚焦于日常咨询情境中隐藏动机的道德方向来弥合这一差距。我们提供了一个包含N=1,035次人机交互的评估数据集,其中测量了用户的信念变化。我们的分析揭示了当前安全范式中的一个关键脱节:虽然模型可以被训练来检测操控策略,但这些策略与由此产生的信念变化幅度并不相关。因此,我们定义了信念变化预测的任务,并表明尽管最先进的LLMs实现了中等相关性(r=0.3-0.5),但它们系统性地低估了人类信念易感性的强度。这项工作通过研究日常实用用户查询中LLMs的动机驱动型操控,为人工智能社会安全工作建立了理论扎根且行为验证的基础。