Large language models (LLMs) are susceptible to persuasion, which can pose risks when models are faced with an adversarial interlocutor. We take a first step towards defending models against persuasion while also arguing that defense against adversarial (i.e. negative) persuasion is only half of the equation: models should also be able to accept beneficial (i.e. positive) persuasion to improve their answers. We show that optimizing models for only one side results in poor performance on the other. In order to balance positive and negative persuasion, we introduce Persuasion-Balanced Training (or PBT), which leverages multi-agent recursive dialogue trees to create data and trains models via preference optimization to accept persuasion when appropriate. PBT consistently improves resistance to misinformation and resilience to being challenged while also resulting in the best overall performance on holistic data containing both positive and negative persuasion. Crucially, we show that PBT models are better teammates in multi-agent debates. We find that without PBT, pairs of stronger and weaker models have unstable performance, with the order in which the models present their answers determining whether the team obtains the stronger or weaker model's performance. PBT leads to better and more stable results and less order dependence, with the stronger model consistently pulling the weaker one up.
翻译:大型语言模型(LLMs)易受说服影响,当模型面对对抗性对话者时可能带来风险。我们迈出了保护模型免受说服影响的第一步,同时指出防御对抗性(即负面)说服仅是问题的一半:模型还应能够接受有益(即正面)的说服以改进其回答。我们证明,仅针对单一方面优化模型会导致另一方面的表现不佳。为了平衡正面与负面说服,我们引入了说服平衡训练(Persuasion-Balanced Training,简称PBT),该方法利用多智能体递归对话树生成数据,并通过偏好优化训练模型在适当时接受说服。PBT持续提升了对错误信息的抵抗力和面对质疑的韧性,同时在包含正面与负面说服的整体数据上实现了最佳综合性能。关键的是,我们发现PBT模型在多智能体辩论中是更优秀的协作者。研究表明,未经PBT训练时,强弱模型组合的表现不稳定,模型回答的顺序决定了团队获得较强还是较弱模型的性能。PBT带来了更好、更稳定的结果及更低的顺序依赖性,使较强模型能够持续提升较弱模型的表现。