When building Large Language Models (LLMs), it is paramount to bear safety in mind and protect them with guardrails. Indeed, LLMs should never generate content promoting or normalizing harmful, illegal, or unethical behavior that may contribute to harm to individuals or society. This principle applies to both normal and adversarial use. In response, we introduce ALERT, a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It is designed to evaluate the safety of LLMs through red teaming methodologies and consists of more than 45k instructions categorized using our novel taxonomy. By subjecting LLMs to adversarial testing scenarios, ALERT aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models. Furthermore, the fine-grained taxonomy enables researchers to perform an in-depth evaluation that also helps one to assess the alignment with various policies. In our experiments, we extensively evaluate 10 popular open- and closed-source LLMs and demonstrate that many of them still struggle to attain reasonable levels of safety.
翻译:在构建大语言模型(LLMs)时,必须将安全性置于首位并通过防护机制加以保障。事实上,LLMs绝不应生成宣扬或美化有害、非法或不道德行为的内容,这些行为可能对个人或社会造成伤害。这一原则适用于正常使用和对抗性使用场景。为此,我们提出ALERT——一个基于新颖细粒度风险分类体系的大规模安全评估基准。该基准旨在通过红队测试方法评估LLMs的安全性,包含超过4.5万条依据我们新分类体系归类的指令。通过将LLMs置于对抗性测试场景中,ALERT致力于识别模型漏洞、指导改进方向,从而全面提升语言模型的安全性。此外,细粒度分类体系使研究人员能够进行深度评估,同时有助于检验模型与各类安全策略的契合度。在我们的实验中,我们对10个流行的开源与闭源LLMs进行了广泛评估,结果表明许多模型仍难以达到合理的安全水平。