Persuade Me if You Can: A Framework for Evaluating Persuasion Effectiveness and Susceptibility Among Large Language Models

from arxiv, Updated to match the NeurIPS MTI-LLM Workshop format. Content remains consistent with the original version, with structural refinements, expanded explanations, and an extended appendix including additional results

Large Language Models (LLMs) demonstrate persuasive capabilities that rival human-level persuasion. While these capabilities can be used for social good, they also present risks of potential misuse. Beyond the concern of how LLMs persuade others, their own susceptibility to persuasion poses a critical alignment challenge, raising questions about robustness, safety, and adherence to ethical principles. To study these dynamics, we introduce Persuade Me If You Can (PMIYC), an automated framework for evaluating persuasiveness and susceptibility to persuasion in multi-agent interactions. Our framework offers a scalable alternative to the costly and time-intensive human annotation process typically used to study persuasion in LLMs. PMIYC automatically conducts multi-turn conversations between Persuader and Persuadee agents, measuring both the effectiveness of and susceptibility to persuasion. Our comprehensive evaluation spans a diverse set of LLMs and persuasion settings (e.g., subjective and misinformation scenarios). We validate the efficacy of our framework through human evaluations and demonstrate alignment with human assessments from prior studies. Through PMIYC, we find that Llama-3.3-70B and GPT-4o exhibit similar persuasive effectiveness, outperforming Claude 3 Haiku by 30%. However, GPT-4o demonstrates over 50% greater resistance to persuasion for misinformation compared to Llama-3.3-70B. These findings provide empirical insights into the persuasive dynamics of LLMs and contribute to the development of safer AI systems.

翻译：大语言模型（LLMs）展现出可与人类水平相媲美的说服能力。尽管这些能力可用于社会公益，但也存在被滥用的潜在风险。除了关注LLMs如何说服他人之外，其自身对说服的易感性构成了一个关键的对齐挑战，引发了关于鲁棒性、安全性及伦理原则遵循的疑问。为研究这些动态，我们提出了“说服我，如果你能”（PMIYC），一个用于评估多智能体交互中说服力与易受说服性的自动化框架。该框架为通常用于研究LLMs说服力的、成本高昂且耗时的人工标注流程提供了一个可扩展的替代方案。PMIYC自动在说服者与被说服者智能体之间进行多轮对话，同时衡量说服的有效性和易受说服性。我们的全面评估涵盖了多样化的LLM集合及多种说服场景（例如，主观性场景与错误信息场景）。我们通过人工评估验证了框架的有效性，并证明其与先前研究中的人工评估结果一致。通过PMIYC，我们发现Llama-3.3-70B和GPT-4o表现出相似的说服效果，优于Claude 3 Haiku达30%。然而，在针对错误信息的说服抵抗方面，GPT-4o比Llama-3.3-70B高出50%以上。这些发现为理解LLMs的说服动态提供了实证依据，并有助于开发更安全的AI系统。