Large Language Models (LLMs) are being integrated into applications such as chatbots or email assistants. To prevent improper responses, safety mechanisms, such as Reinforcement Learning from Human Feedback (RLHF), are implemented in them. In this work, we bypass these safety measures for ChatGPT, Gemini, and Deepseek by making them impersonate complex personas with personality characteristics that are not aligned with a truthful assistant. First, we create elaborate biographies of these personas, which we then use in a new session with the same chatbots. Our conversations then follow a role-play style to elicit prohibited responses. Using personas, we show that prohibited responses are provided, making it possible to obtain unauthorized, illegal, or harmful information when querying ChatGPT, Gemini, and Deepseek. We show that these chatbots are vulnerable to this attack by getting dangerous information for 40 out of 40 illicit questions in GPT-4.1-mini, Gemini-1.5-flash, 39 out of 40 in GPT-4o-mini, 38 out of 40 in GPT-3.5-turbo, and 2 out of 2 cases in Gemini-2.5-flash and DeepSeek V3. The attack can be carried out manually or automatically using a support LLM, and has proven effective against models deployed between 2023 and 2025.
翻译:大型语言模型(LLM)正被集成到聊天机器人或邮件助手等应用中。为防止不当回应,这些模型采用了如基于人类反馈的强化学习(RLHF)等安全机制。本研究中,我们通过让ChatGPT、Gemini和Deepseek模拟具有与真实助手不符人格特征的复杂角色,绕过了这些安全措施。首先,我们为这些角色精心编写了详尽的背景设定,并在新会话中将其应用于同一聊天机器人。随后,我们的对话采用角色扮演方式,诱导模型输出被禁止的回应。通过角色设定,我们证明了这些模型会提供被禁止的回应,从而可能从ChatGPT、Gemini和Deepseek处获取未经授权、非法或有害的信息。实验表明,这些聊天机器人易受此类攻击:在40个非法问题中,GPT-4.1-mini和Gemini-1.5-flash全部给出危险信息,GPT-4o-mini在39个、GPT-3.5-turbo在38个问题中出现此类行为,而Gemini-2.5-flash与DeepSeek V3在2个测试案例中均未幸免。该攻击可手动执行或借助支持LLM自动完成,已被证实对2023至2025年间部署的模型有效。