Evaluating Language Models for Harmful Manipulation

Canfer Akbulut,Rasmi Elasmar,Abhishek Roy,Anthony Payne,Priyanka Suresh,Lujain Ibrahim,Seliem El-Sayed,Charvi Rastogi,Ashyana Kachra,Will Hawkins,Kristian Lum,Laura Weidinger

Interest in the concept of AI-driven harmful manipulation is growing, yet current approaches to evaluating it are limited. This paper introduces a framework for evaluating harmful AI manipulation via context-specific human-AI interaction studies. We illustrate the utility of this framework by assessing an AI model with 10,101 participants spanning interactions in three AI use domains (public policy, finance, and health) and three locales (US, UK, and India). Overall, we find that that the tested model can produce manipulative behaviours when prompted to do so and, in experimental settings, is able to induce belief and behaviour changes in study participants. We further find that context matters: AI manipulation differs between domains, suggesting that it needs to be evaluated in the high-stakes context(s) in which an AI system is likely to be used. We also identify significant differences across our tested geographies, suggesting that AI manipulation results from one geographic region may not generalise to others. Finally, we find that the frequency of manipulative behaviours (propensity) of an AI model is not consistently predictive of the likelihood of manipulative success (efficacy), underscoring the importance of studying these dimensions separately. To facilitate adoption of our evaluation framework, we detail our testing protocols and make relevant materials publicly available. We conclude by discussing open challenges in evaluating harmful manipulation by AI models.

翻译：对AI驱动有害操纵概念的兴趣日益增长，但当前评估方法仍存在局限。本文提出一个框架，通过情境化人机交互研究评估AI的有害操纵行为。我们通过评估某AI模型（覆盖10,101名参与者，涉及公共政策、金融和健康三个AI应用领域，以及美国、英国和印度三个地域）来展示该框架的实用性。总体而言，我们发现：被测试模型在被诱导时能够产生操纵性行为，并在实验环境中成功改变研究参与者的信念与行为。我们进一步发现情境的重要性：AI操纵行为在不同领域存在差异，这表明需在AI系统可能实际应用的高风险场景中进行评估。我们还发现不同测试地域存在显著差异，说明某一地理区域的AI操纵结果可能无法推广至其他地区。最后，研究发现AI模型操纵行为的发生频率（倾向性）并不一致预测操纵成功的可能性（有效性），这凸显了分别研究这两个维度的必要性。为促进评估框架的应用，我们详述了测试流程并公开相关材料。最后，我们探讨了评估AI模型有害操纵行为面临的开放性挑战。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

【斯坦福博士论文】语言模型的机械可解释性与控制

专知会员服务

11+阅读 · 4月23日

可解释人工智能（XAI）：从内在可解释性到大语言模型

专知会员服务

34+阅读 · 2025年1月20日

《评估人工智能的风险和影响（ARIA》美国国家标准与技术研究院最新报告

专知会员服务

23+阅读 · 2025年1月6日

设计和构建强大的大语言模型智能体

专知会员服务

55+阅读 · 2024年10月6日