Large language models (LLMs) are susceptible to persuasion, which can pose risks when models are faced with an adversarial interlocutor. We take a first step towards defending models against persuasion while also arguing that defense against adversarial (i.e. negative) persuasion is only half of the equation: models should also be able to accept beneficial (i.e. positive) persuasion to improve their answers. We show that optimizing models for only one side results in poor performance on the other. In order to balance positive and negative persuasion, we introduce Persuasion-Training (or PBT), which leverages multi-agent recursive dialogue trees to create data and trains models via preference optimization to accept persuasion when appropriate. PBT allows us to use data generated from dialogues between smaller 7-8B models for training much larger 70B models. Moreover, PBT consistently improves resistance to misinformation and resilience to being challenged while also resulting in the best overall performance on holistic data containing both positive and negative persuasion. Crucially, we show that PBT models are better teammates in multi-agent debates across two domains (trivia and commonsense QA). We find that without PBT, pairs of stronger and weaker models have unstable performance, with the order in which the models present their answers determining whether the team obtains the stronger or weaker model's performance. PBT leads to better and more stable results and less order dependence, with the stronger model consistently pulling the weaker one up.
翻译:大型语言模型(LLMs)容易受到说服的影响,当模型面对对抗性对话者时可能带来风险。我们在防御模型免受说服影响方面迈出了第一步,同时提出防御对抗性(即负面)说服仅是问题的一半:模型还应能够接受有益(即正面)的说服以改进其回答。我们证明,仅针对单一方面优化模型会导致在另一方面的表现不佳。为了平衡正面与负面说服,我们引入了说服训练(Persuasion-Training,简称PBT),该方法利用多智能体递归对话树生成数据,并通过偏好优化训练模型以在适当时接受说服。PBT使我们能够使用较小规模(7-8B参数)模型间对话生成的数据来训练更大规模(70B参数)的模型。此外,PBT持续提升了模型对错误信息的抵抗力和面对挑战时的韧性,同时在包含正面与负面说服的整体数据上实现了最佳综合性能。关键的是,我们证明PBT模型在两个领域(常识问答与琐事问答)的多智能体辩论中是更优秀的协作伙伴。我们发现,未经PBT训练的模型对中,强弱模型的组合表现不稳定,模型回答的先后顺序决定了团队最终获得较强还是较弱模型的性能。PBT带来了更好、更稳定的结果及更低的顺序依赖性,使较强模型能够持续提升较弱模型的表现。