Large Language Models (LLMs) demonstrate persuasive capabilities that rival human-level persuasion. While these capabilities can be used for social good, they also present risks of potential misuse. Beyond the concern of how LLMs persuade others, their own susceptibility to persuasion poses a critical alignment challenge, raising questions about robustness, safety, and adherence to ethical principles. To study these dynamics, we introduce Persuade Me If You Can (PMIYC), an automated framework for evaluating persuasiveness and susceptibility to persuasion in multi-agent interactions. Our framework offers a scalable alternative to the costly and time-intensive human annotation process typically used to study persuasion in LLMs. PMIYC automatically conducts multi-turn conversations between Persuader and Persuadee agents, measuring both the effectiveness of and susceptibility to persuasion. Our comprehensive evaluation spans a diverse set of LLMs and persuasion settings (e.g., subjective and misinformation scenarios). We validate the efficacy of our framework through human evaluations and demonstrate alignment with human assessments from prior studies. Through PMIYC, we find that Llama-3.3-70B and GPT-4o exhibit similar persuasive effectiveness, outperforming Claude 3 Haiku by 30%. However, GPT-4o demonstrates over 50% greater resistance to persuasion for misinformation compared to Llama-3.3-70B. Notably, o4-mini emerges as both an effective persuader, and a resistant persuadee. These findings provide empirical insights into the persuasive dynamics of LLMs and contribute to the development of safer AI systems.
翻译:大型语言模型(LLM)展现出可与人类相匹敌的说服能力。尽管这些能力可用于社会公益,但也存在被滥用的风险。除了关注LLM如何说服他人,其自身对说服的易感性更构成关键的对齐挑战,引发关于鲁棒性、安全性与伦理原则遵循性的问题。为研究这些动态机制,我们提出PMIYC("若你能说服我")框架——一个用于评估多智能体交互中说服能力与易感性的自动化系统。该框架为传统耗时且昂贵的人工标注流程提供了可扩展的替代方案。PMIYC能自动进行说服者与受服者智能体间的多轮对话,同步测量说服效能与易感性。我们的综合评估涵盖多种LLM及说服场景(如主观议题与虚假信息情境)。通过人工评估验证了框架有效性,并证实其与既有研究中人类评估结果的一致性。基于PMIYC的实验发现:Llama-3.3-70B与GPT-4o展现出接近的说服效能,均优于Claude 3 Haiku达30%;但在应对虚假信息时,GPT-4o的抗说服能力较Llama-3.3-70B高出逾50%。值得注意的是,o4-mini兼具高效说服者与抗说服受服者的双重特征。这些发现为LLM的说服动力学提供了实证洞见,并推动更安全AI系统的研发。