Large language models (LLMs) have skyrocketed in popularity in recent years due to their ability to generate high-quality text in response to human prompting. However, these models have been shown to have the potential to generate harmful content in response to user prompting (e.g., giving users instructions on how to commit crimes). There has been a focus in the literature on mitigating these risks, through methods like aligning models with human values through reinforcement learning. However, it has been shown that even aligned language models are susceptible to adversarial attacks that bypass their restrictions on generating harmful text. We propose a simple approach to defending against these attacks by having a large language model filter its own responses. Our current results show that even if a model is not fine-tuned to be aligned with human values, it is possible to stop it from presenting harmful content to users by validating the content using a language model.
翻译:大语言模型(LLMs)近年来因其能根据人类提示生成高质量文本的能力而迅速普及。然而,研究表明这些模型可能因用户提示而生成有害内容(例如,向用户提供犯罪指导)。现有文献重点关注通过强化学习等方法使模型与人类价值观对齐以缓解此类风险。但研究显示,即使是对齐后的语言模型也容易受到对抗性攻击,从而绕过其对生成有害文本的限制。我们提出一种简单的防御方法,即让大语言模型过滤自身生成的回答。当前结果表明,即使模型未经微调以实现与人类价值观对齐,通过使用语言模型验证内容,仍可能阻止其向用户呈现有害信息。