Large language models (LLMs) have demonstrated strong persuasive capabilities comparable to those of humans, offering promising benefits while raising societal concerns. However, systematically evaluating the persuasive capabilities of LLMs is inherently challenging, as the effectiveness of persuasion among humans varies significantly across different domains. In this paper, we take a theory-driven approach to provide a scalable and principled framework for studying the persuasive capabilities of LLMs. Grounded in Bayesian persuasion theory, we repurpose human-human persuasion datasets to construct environments for evaluating and training LLMs as strategic persuaders. Our results reveal that frontier models can consistently achieve high persuasion gains and exhibit sophisticated persuasion strategies that align with theoretical characterizations. Building on this, we use reinforcement learning to train LLMs for strategic persuasion in our environments. Our results also demonstrate that even small LLMs can obtain significantly higher persuasion gains through reinforcement learning.
翻译:大型语言模型(LLM)已展现出与人类相当的强大说服能力,这既带来了潜在益处,也引发了社会担忧。然而,系统评估LLM的说服能力本质上面临挑战,因为人类间的说服效果在不同领域存在显著差异。本文采用理论驱动的方法,为研究LLM的说服能力提供了一个可扩展且原则性的框架。基于贝叶斯说服理论,我们重新利用人类间说服数据集构建评估和训练LLM作为战略说服者的环境。研究结果表明,前沿模型能够持续实现高说服增益,并展现出符合理论特征的复杂说服策略。在此基础上,我们使用强化学习在构建的环境中训练LLM进行战略性说服。结果还表明,即使是小型LLM也能通过强化学习获得显著更高的说服增益。