The deployment of Large Language Models (LLMs) in content generation raises significant safety concerns, particularly regarding the transparency and interpretability of content evaluations. Current methods, primarily focused on binary safety classifications, lack mechanisms for detailed critique, limiting their utility for model improvement and user trust. To address these limitations, we introduce SAFETY-J, a bilingual generative safety evaluator for English and Chinese with critique-based judgment. SAFETY-J utilizes a robust training dataset that includes diverse dialogues and augmented query-response pairs to assess safety across various scenarios comprehensively. We establish an automated meta-evaluation benchmark that objectively assesses the quality of critiques with minimal human intervention, facilitating scalable and continuous improvement. Additionally, SAFETY-J employs an iterative preference learning technique to dynamically refine safety assessments based on meta-evaluations and critiques. Our evaluations demonstrate that SAFETY-J provides more nuanced and accurate safety evaluations, thereby enhancing both critique quality and predictive reliability in complex content scenarios. To facilitate further research and application, we open-source SAFETY-J's training protocols, datasets, and code at \url{https://github.com/GAIR-NLP/Safety-J}.
翻译:大型语言模型(LLM)在内容生成领域的部署引发了重大安全隐患,尤其在内容评估的透明度与可解释性方面。当前方法主要集中于二元安全分类,缺乏提供详细批判性分析的机制,限制了其在模型改进与用户信任构建中的效用。为应对这些局限,本文提出SAFETY-J——一个支持中英双语的生成式安全评估器,其具备基于批判性判断的能力。SAFETY-J采用包含多样化对话及增强型查询-响应对的鲁棒训练数据集,以全面评估多场景下的安全性。我们构建了自动化元评估基准,能以最少人工干预客观评价批判内容的质量,从而支持可扩展的持续改进。此外,SAFETY-J采用迭代式偏好学习技术,基于元评估与批判动态优化安全评估。实验表明,SAFETY-J能提供更细致准确的安全评估,显著提升复杂内容场景下的批判质量与预测可靠性。为促进后续研究与应用,我们在\url{https://github.com/GAIR-NLP/Safety-J}开源了SAFETY-J的训练方案、数据集及代码。