With the rapid evolution of large language models (LLMs), new and hard-to-predict harmful capabilities are emerging. This requires developers to be able to identify risks through the evaluation of "dangerous capabilities" in order to responsibly deploy LLMs. In this work, we collect the first open-source dataset to evaluate safeguards in LLMs, and deploy safer open-source LLMs at a low cost. Our dataset is curated and filtered to consist only of instructions that responsible language models should not follow. We annotate and assess the responses of six popular LLMs to these instructions. Based on our annotation, we proceed to train several BERT-like classifiers, and find that these small classifiers can achieve results that are comparable with GPT-4 on automatic safety evaluation. Warning: this paper contains example data that may be offensive, harmful, or biased.
翻译:随着大语言模型的快速发展,新的且难以预测的有害能力不断涌现。这就要求开发者能够通过评估"危险能力"来识别风险,从而负责任地部署大语言模型。本研究收集了首个用于评估大语言模型安全防护的开源数据集,并以低成本部署了更安全的开源大语言模型。我们的数据集经过策展和过滤,仅包含负责任的语言模型不应遵循的指令。我们对六种主流大语言模型对这些指令的响应进行了标注和评估。基于标注结果,我们进一步训练了多个类似BERT的分类器,并发现这些小型分类器在自动安全评估中能达到与GPT-4相当的效果。警告:本文包含可能具有攻击性、有害性或偏见性的示例数据。