LLMs are increasingly being used in workflows involving generating content to be consumed by humans (e.g., marketing) and also in directly interacting with humans (e.g., through chatbots). The development of such systems that are capable of generating verifiably persuasive messages presents both opportunities and challenges for society. On the one hand, such systems could positively impact domains like advertising and social good, such as addressing drug addiction, and on the other, they could be misused for spreading misinformation and shaping political opinions. To channel LLMs' impact on society, we need to develop systems to measure and benchmark their persuasiveness. With this motivation, we introduce PersuasionBench and PersuasionArena, the first large-scale benchmark and arena containing a battery of tasks to measure the persuasion ability of generative models automatically. We investigate to what extent LLMs know and leverage linguistic patterns that can help them generate more persuasive language. Our findings indicate that the persuasiveness of LLMs correlates positively with model size, but smaller models can also be made to have a higher persuasiveness than much larger models. Notably, targeted training using synthetic and natural datasets significantly enhances smaller models' persuasive capabilities, challenging scale-dependent assumptions. Our findings carry key implications for both model developers and policymakers. For instance, while the EU AI Act and California's SB-1047 aim to regulate AI models based on the number of floating point operations, we demonstrate that simple metrics like this alone fail to capture the full scope of AI's societal impact. We invite the community to explore and contribute to PersuasionArena and PersuasionBench, available at https://bit.ly/measure-persuasion, to advance our understanding of AI-driven persuasion and its societal implications.
翻译:大型语言模型正日益广泛地应用于涉及生成供人类消费内容的工作流程(例如营销),以及直接与人类互动的场景(例如通过聊天机器人)。开发能够生成可验证劝说性信息的此类系统,既为社会带来机遇,也带来挑战。一方面,此类系统可能对广告和公益领域(如应对药物成瘾)产生积极影响;另一方面,它们也可能被滥用于传播错误信息和塑造政治观点。为引导大型语言模型对社会的影响,我们需要开发系统来量化和评估其劝说能力。基于此动机,我们提出了PersuasionBench和PersuasionArena——首个包含一系列任务的大规模基准测试平台与竞技场,用于自动评估生成模型的劝说能力。我们探究了大型语言模型在多大程度上了解并利用能够帮助其生成更具劝说性语言的语用模式。研究结果表明,大型语言模型的劝说能力与模型规模呈正相关,但较小的模型经过优化后,其劝说能力也可能超越规模大得多的模型。值得注意的是,利用合成与自然数据集进行针对性训练能显著提升较小模型的劝说能力,这对规模依赖的假设提出了挑战。我们的研究发现对模型开发者和政策制定者均具有重要启示。例如,尽管欧盟《人工智能法案》和加利福尼亚州的SB-1047法案试图基于浮点运算次数来监管人工智能模型,但我们证明仅凭此类简单指标无法全面捕捉人工智能社会影响的完整范畴。我们诚邀学界同仁通过https://bit.ly/measure-persuasion访问并参与完善PersuasionArena与PersuasionBench,共同推进对人工智能驱动型劝说及其社会影响的理解。