Research into AI alignment has grown considerably since the recent introduction of increasingly capable Large Language Models (LLMs). Unfortunately, modern methods of alignment still fail to fully prevent harmful responses when models are deliberately attacked. These attacks can trick seemingly aligned models into giving manufacturing instructions for dangerous materials, inciting violence, or recommending other immoral acts. To help mitigate this issue, we introduce Bergeron: a framework designed to improve the robustness of LLMs against attacks without any additional parameter fine-tuning. Bergeron is organized into two tiers; with a secondary LLM emulating the conscience of a protected, primary LLM. This framework better safeguards the primary model against incoming attacks while monitoring its output for any harmful content. Empirical analysis shows that, by using Bergeron to complement models with existing alignment training, we can improve the robustness and safety of multiple, commonly used commercial and open-source LLMs.
翻译:自近年来日益强大的大语言模型(LLMs)问世以来,针对AI对齐的研究显著增长。遗憾的是,当模型遭受蓄意攻击时,现代对齐方法仍无法完全阻止有害响应。这些攻击能够诱使看似对齐的模型给出危险材料的制造指令、煽动暴力或推荐其他不道德行为。为缓解此问题,我们提出Bergeron:一种无需额外参数微调即可提升LLMs抗攻击鲁棒性的框架。Bergeron采用双层架构,由辅助LLM模拟受保护主LLM的良知。该框架能更有效地保护主模型免受入侵攻击,同时监控其输出中的有害内容。实证分析表明,通过使用Bergeron补充现有对齐训练模型,可提升多种常用商业与开源LLM的鲁棒性与安全性。