This year, we witnessed a rise in the use of Large Language Models, especially when combined with applications like chatbot assistants. Safety mechanisms and specialized training procedures are put in place to prevent improper responses from these assistants. In this work, we bypass these measures for ChatGPT and Bard (and, to some extent, Bing chat) by making them impersonate complex personas with opposite characteristics as those of the truthful assistants they are supposed to be. We start by creating elaborate biographies of these personas, which we then use in a new session with the same chatbots. Our conversation followed a role-play style to get the response the assistant was not allowed to provide. By making use of personas, we show that the response that is prohibited is actually provided, making it possible to obtain unauthorized, illegal, or harmful information. This work shows that by using adversarial personas, one can overcome safety mechanisms set out by ChatGPT and Bard. It also introduces several ways of activating such adversarial personas, altogether showing that both chatbots are vulnerable to this kind of attack.
翻译:今年,我们见证了大型语言模型使用的激增,尤其是当其与聊天助手类应用结合时。为防范此类助手产生不当回应,业界已部署了安全机制和专门训练程序。在本研究中,我们通过让ChatGPT和Bard(以及在一定程度上Bing聊天)模拟与它们本应扮演的诚实助手截然相反的复杂人格特征,成功绕过了这些防护措施。我们首先为这些虚拟角色精心构建了详尽背景档案,随后在相同聊天机器人的新会话中加以运用。对话采用角色扮演风格,旨在获取助手通常被禁止提供的回应。研究表明,通过运用虚拟角色,本应被禁止的回应竟被实际生成,使得获取未经授权、非法或有害信息成为可能。这项工作证明,利用对抗性虚拟角色可以突破ChatGPT和Bard设置的安全机制。同时,本文还介绍了激活此类对抗性虚拟角色的多种方法,充分表明这两款聊天机器人均易受此类攻击影响。