Only a year ago, we witnessed a rise in the use of Large Language Models (LLMs), especially when combined with applications like chatbot assistants. Safety mechanisms and specialized training procedures are implemented to prevent improper responses from these assistants. In this work, we bypass these measures for ChatGPT and Bard (and, to some extent, Bing chat) by making them impersonate complex personas with opposite characteristics as those of the truthful assistants they are supposed to be. We start by creating elaborate biographies of these personas, which we then use in a new session with the same chatbots. Our conversation followed a role-play style to get the response the assistant was not allowed to provide. By making use of personas, we show that the response that is prohibited is actually provided, making it possible to obtain unauthorized, illegal, or harmful information. This work shows that by using adversarial personas, one can overcome safety mechanisms set out by ChatGPT and Bard. We also introduce several ways of activating such adversarial personas, altogether showing that both chatbots are vulnerable to this kind of attack. With the same principle, we introduce two defenses that push the model to interpret trustworthy personalities and make it more robust against such attacks.
翻译:仅一年前,我们目睹了大语言模型(LLMs)使用的激增,尤其是当其与聊天助手类应用结合时。为防止这类助手产生不当回应,业界实施了安全机制和专门训练流程。本研究中,我们通过让ChatGPT和Bard(以及某种程度上Bing Chat)模仿与真实助手本应具备的品质截然相反的复杂人格,成功绕过了这些防御措施。首先为这些人格创建详细背景设定,随后在新会话中将其应用于同一聊天机器人。我们的对话采用角色扮演形式,以获取助手原本被禁止提供的回应。通过利用人格设定,我们证实原本被禁止的回应确实得以生成,进而可能获取未经授权、非法或有害信息。研究表明,采用对抗性人格可突破ChatGPT和Bard设置的安全机制。我们还提出了多种激活这类对抗性人格的方法,充分证明两款聊天机器人均易受此类攻击。基于相同原理,我们提出了两种防御方案,推动模型解读可信人格特质,并增强其应对此类攻击的鲁棒性。