The safety of Large Language Models (LLMs) has gained increasing attention in recent years, but there still lacks a comprehensive approach for detecting safety issues within LLMs' responses in an aligned, customizable and explainable manner. In this paper, we propose ShieldLM, an LLM-based safety detector, which aligns with general human safety standards, supports customizable detection rules, and provides explanations for its decisions. To train ShieldLM, we compile a large bilingual dataset comprising 14,387 query-response pairs, annotating the safety of responses based on various safety standards. Through extensive experiments, we demonstrate that ShieldLM surpasses strong baselines across four test sets, showcasing remarkable customizability and explainability. Besides performing well on standard detection datasets, ShieldLM has also been shown to be effective in real-world situations as a safety evaluator for advanced LLMs. We release ShieldLM at \url{https://github.com/thu-coai/ShieldLM} to support accurate and explainable safety detection under various safety standards, contributing to the ongoing efforts to enhance the safety of LLMs.
翻译:大语言模型的安全性近年来受到日益关注,但目前仍缺乏一种以对齐、可定制且可解释的方式检测大语言模型回答中安全问题的综合性方法。本文提出ShieldLM——一种基于大语言模型的安全检测器,它既与通用人类安全标准对齐,又支持可定制的检测规则,并为其决策提供解释。为训练ShieldLM,我们构建了一个包含14,387个查询-回答对的大规模双语数据集,根据多种安全标准对回答的安全性进行标注。通过大量实验,我们证明ShieldLM在四个测试集上均超越强基线方法,展现出卓越的可定制性与可解释性。除在标准检测数据集上表现优异外,ShieldLM在真实场景中作为先进大语言模型的安全评估器也被证实有效。我们已在\url{https://github.com/thu-coai/ShieldLM}发布ShieldLM,以支持在多种安全标准下进行准确且可解释的安全检测,助力持续提升大语言模型安全性的相关工作。