Large language models (LLMs) have skyrocketed in popularity in recent years due to their ability to generate high-quality text in response to human prompting. However, these models have been shown to have the potential to generate harmful content in response to user prompting (e.g., giving users instructions on how to commit crimes). There has been a focus in the literature on mitigating these risks, through methods like aligning models with human values through reinforcement learning. However, it has been shown that even aligned language models are susceptible to adversarial attacks that bypass their restrictions on generating harmful text. We propose a simple approach to defending against these attacks by having a large language model filter its own responses. Our current results show that even if a model is not fine-tuned to be aligned with human values, it is possible to stop it from presenting harmful content to users by validating the content using a language model.
翻译:大语言模型(LLMs)近年来因其能够根据人类提示生成高质量文本而迅速普及。然而,研究表明这些模型可能根据用户提示生成有害内容(例如,向用户提供犯罪指导)。现有文献中已有大量研究致力于降低这些风险,例如通过强化学习使模型与人类价值观对齐。但即使是对齐后的语言模型,仍易受到突破有害文本生成限制的对抗性攻击。我们提出一种简单的防御方法,即让大语言模型自行过滤其生成的回答。当前结果表明,即使模型未经过人类价值观对齐的微调,通过使用语言模型验证内容,仍可防止其向用户呈现有害信息。