SAFETY-J: Evaluating Safety with Critique

The deployment of Large Language Models (LLMs) in content generation raises significant safety concerns, particularly regarding the transparency and interpretability of content evaluations. Current methods, primarily focused on binary safety classifications, lack mechanisms for detailed critique, limiting their utility for model improvement and user trust. To address these limitations, we introduce SAFETY-J, a bilingual generative safety evaluator for English and Chinese with critique-based judgment. SAFETY-J utilizes a robust training dataset that includes diverse dialogues and augmented query-response pairs to assess safety across various scenarios comprehensively. We establish an automated meta-evaluation benchmark that objectively assesses the quality of critiques with minimal human intervention, facilitating scalable and continuous improvement. Additionally, SAFETY-J employs an iterative preference learning technique to dynamically refine safety assessments based on meta-evaluations and critiques. Our evaluations demonstrate that SAFETY-J provides more nuanced and accurate safety evaluations, thereby enhancing both critique quality and predictive reliability in complex content scenarios. To facilitate further research and application, we open-source SAFETY-J's training protocols, datasets, and code at \url{https://github.com/GAIR-NLP/Safety-J}.

翻译：大型语言模型（LLM）在内容生成领域的部署引发了重大安全隐患，尤其在内容评估的透明度与可解释性方面。当前方法主要集中于二元安全分类，缺乏提供详细批判性分析的机制，限制了其在模型改进与用户信任构建中的效用。为应对这些局限，本文提出SAFETY-J——一个支持中英双语的生成式安全评估器，其具备基于批判性判断的能力。SAFETY-J采用包含多样化对话及增强型查询-响应对的鲁棒训练数据集，以全面评估多场景下的安全性。我们构建了自动化元评估基准，能以最少人工干预客观评价批判内容的质量，从而支持可扩展的持续改进。此外，SAFETY-J采用迭代式偏好学习技术，基于元评估与批判动态优化安全评估。实验表明，SAFETY-J能提供更细致准确的安全评估，显著提升复杂内容场景下的批判质量与预测可靠性。为促进后续研究与应用，我们在\url{https://github.com/GAIR-NLP/Safety-J}开源了SAFETY-J的训练方案、数据集及代码。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日