While state-of-the-art language models have achieved impressive results, they remain susceptible to inference-time adversarial attacks, such as adversarial prompts generated by red teams arXiv:2209.07858. One approach proposed to improve the general quality of language model generations is multi-agent debate, where language models self-evaluate through discussion and feedback arXiv:2305.14325. We implement multi-agent debate between current state-of-the-art language models and evaluate models' susceptibility to red team attacks in both single- and multi-agent settings. We find that multi-agent debate can reduce model toxicity when jailbroken or less capable models are forced to debate with non-jailbroken or more capable models. We also find marginal improvements through the general usage of multi-agent interactions. We further perform adversarial prompt content classification via embedding clustering, and analyze the susceptibility of different models to different types of attack topics.
翻译:尽管当前最先进的语言模型取得了令人瞩目的成就,它们仍易受推理时对抗性攻击的影响,例如由红队生成的对抗性提示(arXiv:2209.07858)。为提升语言模型生成内容的整体质量,一种提出的方法是多智能体辩论——即语言模型通过讨论与反馈进行自我评估(arXiv:2305.14325)。我们在当前最先进的语言模型之间实施多智能体辩论,并在单一智能体和多智能体场景下评估模型对红队攻击的敏感度。研究发现,当被破解或能力较弱的模型被迫与非破解或能力更强的模型进行辩论时,多智能体辩论可降低模型的毒性。我们还观察到,普遍采用多智能体交互能带来边际改进。此外,我们通过嵌入聚类对对抗性提示内容进行分类,并分析了不同模型对不同攻击主题的敏感度。