With the rapid evolution of large language models (LLMs), new and hard-to-predict harmful capabilities are emerging. This requires developers to be able to identify risks through the evaluation of "dangerous capabilities" in order to responsibly deploy LLMs. In this work, we collect the first open-source dataset to evaluate safeguards in LLMs, and deploy safer open-source LLMs at a low cost. Our dataset is curated and filtered to consist only of instructions that responsible language models should not follow. We annotate and assess the responses of six popular LLMs to these instructions. Based on our annotation, we proceed to train several BERT-like classifiers, and find that these small classifiers can achieve results that are comparable with GPT-4 on automatic safety evaluation. Warning: this paper contains example data that may be offensive, harmful, or biased.
翻译:随着大语言模型(LLMs)的快速发展,新型且难以预测的有害能力不断涌现。这要求开发者能够通过评估“危险能力”来识别风险,从而负责任地部署大语言模型。本研究收集了首个开源数据集,用于评估大语言模型的安全防护,并以低成本部署更安全的开源大语言模型。我们精心筛选并构建数据集,使其仅包含负责任的模型不应遵循的指令。我们对六种主流大语言模型针对这些指令的响应进行标注与评估。基于标注结果,我们训练了多个类似BERT的分类器,并发现这些小规模分类器在自动安全评估任务中能达到与GPT-4相当的效果。警告:本文包含可能具有冒犯性、有害性或偏见性的示例数据。