Recent generative AI systems have demonstrated more advanced persuasive capabilities and are increasingly permeating areas of life where they can influence decision-making. Generative AI presents a new risk profile of persuasion due the opportunity for reciprocal exchange and prolonged interactions. This has led to growing concerns about harms from AI persuasion and how they can be mitigated, highlighting the need for a systematic study of AI persuasion. The current definitions of AI persuasion are unclear and related harms are insufficiently studied. Existing harm mitigation approaches prioritise harms from the outcome of persuasion over harms from the process of persuasion. In this paper, we lay the groundwork for the systematic study of AI persuasion. We first put forward definitions of persuasive generative AI. We distinguish between rationally persuasive generative AI, which relies on providing relevant facts, sound reasoning, or other forms of trustworthy evidence, and manipulative generative AI, which relies on taking advantage of cognitive biases and heuristics or misrepresenting information. We also put forward a map of harms from AI persuasion, including definitions and examples of economic, physical, environmental, psychological, sociocultural, political, privacy, and autonomy harm. We then introduce a map of mechanisms that contribute to harmful persuasion. Lastly, we provide an overview of approaches that can be used to mitigate against process harms of persuasion, including prompt engineering for manipulation classification and red teaming. Future work will operationalise these mitigations and study the interaction between different types of mechanisms of persuasion.
翻译:近期生成式AI系统展现出更高级的说服能力,并日益渗透到可能影响决策的生活领域。生成式AI因具备双向对话和持续交互的机会,呈现出新的说服风险特征。这引发了对AI说服危害及其缓解方法的日益关注,凸显了系统研究AI说服的必要性。当前AI说服的定义尚不清晰,相关危害研究也不充分。现有危害缓解方法更关注说服结果而非说服过程带来的危害。本文为AI说服的系统性研究奠定基础。我们首先提出具有说服能力的生成式AI的定义,区分了理性说服型生成式AI(依赖提供相关事实、合理推理或其他可信证据)与操纵型生成式AI(利用认知偏差与启发式方法或歪曲信息)。同时提出AI说服的危害图谱,包含经济、身体、环境、心理、社会文化、政治、隐私及自主性危害的定义与示例。随后引入导致有害说服的机制图谱。最后概述可缓解说服过程危害的方法,包括用于操纵分类的提示工程和红队测试。未来工作将实施这些缓解措施,并研究不同说服机制类型间的相互作用。