The risks derived from large language models (LLMs) generating deceptive and damaging content have been the subject of considerable research, but even safe generations can lead to problematic downstream impacts. In our study, we shift the focus to how even safe text coming from LLMs can be easily turned into potentially dangerous content through Bait-and-Switch attacks. In such attacks, the user first prompts LLMs with safe questions and then employs a simple find-and-replace post-hoc technique to manipulate the outputs into harmful narratives. The alarming efficacy of this approach in generating toxic content highlights a significant challenge in developing reliable safety guardrails for LLMs. In particular, we stress that focusing on the safety of the verbatim LLM outputs is insufficient and that we also need to consider post-hoc transformations.
翻译:大型语言模型生成欺骗性和有害内容所带来的风险已成为大量研究的焦点,但即使安全性生成的输出也可能引发有问题的下游影响。在我们的研究中,我们将焦点转移到如何让大型语言模型生成的安全文本,通过诱导切换攻击轻易转变为潜在危险内容。在这种攻击中,用户首先以安全提问方式向模型请求内容,随后利用简单的查找替换事后处理技术,将输出操纵为有害叙述。该方法在生成毒性内容方面惊人的有效性,凸显了为大型语言模型开发可靠安全护栏的重大挑战。我们特别强调,仅关注模型逐字输出的安全性是不够的,还必须考虑事后转换的影响。