Large language models (LLMs) have demonstrated strong persuasive capabilities comparable to those of humans, offering promising benefits while raising societal concerns about their deployment. However, systematically evaluating the persuasive capabilities of LLMs is inherently challenging, as the effectiveness of persuasion among humans varies significantly across different domains. In this paper, we take a theory-driven approach to provide a scalable and principled framework for measuring the persuasive capabilities of LLMs. Grounded in the Bayesian Persuasion (BP) framework, we repurpose existing human-human persuasion datasets to construct environments for evaluating and training LLMs in strategic persuasion. Our results reveal that frontier models can consistently achieve high persuasion gains and exhibit sophisticated persuasion strategies that align with theoretical predictions. Building on this, we use reinforcement learning to train LLMs for strategic persuasion in our environments. Our results also demonstrate that even small LLMs can obtain significantly higher persuasion gains through reinforcement learning.
翻译:大型语言模型(LLMs)已展现出与人类相当的强大说服能力,在带来潜在益处的同时,也引发了关于其部署的社会担忧。然而,系统评估LLMs的说服能力本质上是具有挑战性的,因为人类间的说服效果在不同领域存在显著差异。本文采用理论驱动的方法,为衡量LLMs的说服能力提供一个可扩展且基于原则的框架。基于贝叶斯说服(BP)框架,我们重新利用现有的人与人说服数据集,构建用于评估和训练LLMs在策略性说服中的环境。我们的结果表明,前沿模型能够持续实现较高的说服增益,并展现出与理论预测相符的复杂说服策略。在此基础上,我们使用强化学习在我们的环境中训练LLMs进行策略性说服。结果还表明,即使是小型LLMs也能通过强化学习获得显著更高的说服增益。