The potential for large language models (LLMs) to generate harmful content poses a significant safety risk in their deployment. To address and assess this risk, the community has developed numerous harmfulness evaluation metrics and judges. However, the lack of a systematic benchmark for evaluating these metrics and judges undermines the credibility and consistency of LLM safety assessments. To bridge this gap, we introduce HarmMetric Eval, a comprehensive benchmark designed to support both overall and fine-grained evaluation of harmfulness metrics and judges. In HarmMetric Eval, we build a high-quality dataset of representative harmful prompts paired with highly diverse harmful model responses and non-harmful counterparts across multiple categories. We also propose a flexible scoring mechanism that rewards the metrics for correctly ranking harmful responses above non-harmful ones, which is applicable to almost all existing metrics and judges with varying output formats and scoring scales. Using HarmMetric Eval, we uncover a surprising finding by extensive experiments: Conventional reference-based metrics such as ROUGE and METEOR can outperform existing LLM-based judges in fine-grained harmfulness evaluation, challenging prevailing assumptions about LLMs'superiority in this domain. To reveal the reasons behind this finding, we provide a fine-grained analysis to explain the limitations of LLM-based judges on rating irrelevant or useless responses. Furthermore, we build a new harmfulness judge by incorporating the fine-grained criteria into its prompt template and leverage reference-based metrics to fine-tune its base LLM. The resulting judge demonstrates superior performance than all existing metrics and judges in evaluating harmful responses.
翻译:大型语言模型(LLM)生成有害内容的可能性对其部署构成了重大的安全风险。为应对和评估此风险,研究社区已开发出众多有害性评估度量标准和评判器。然而,由于缺乏用于系统评估这些度量标准和评判器的基准,LLM安全性评估的可信度与一致性受到削弱。为弥补这一空白,我们引入了HarmMetric Eval,这是一个旨在支持对有害性度量标准和评判器进行整体及细粒度评估的综合基准。在HarmMetric Eval中,我们构建了一个高质量数据集,其中包含多个类别下具有代表性的有害提示词,以及与之配对的高度多样化的有害模型回复和非有害对照回复。我们还提出了一种灵活的评分机制,该机制奖励那些能够正确将有害回复排序在非有害回复之上的度量标准,该机制适用于几乎所有具有不同输出格式和评分尺度的现有度量标准与评判器。通过使用HarmMetric Eval,我们通过大量实验揭示了一个令人惊讶的发现:在细粒度有害性评估中,传统的基于参考的度量标准(如ROUGE和METEOR)能够超越现有的基于LLM的评判器,这对关于LLM在此领域优越性的普遍假设提出了挑战。为揭示这一发现背后的原因,我们进行了细粒度分析,以解释基于LLM的评判器在评估无关或无用回复时的局限性。此外,我们通过将细粒度标准纳入其提示模板,并利用基于参考的度量标准对其基础LLM进行微调,构建了一个新的有害性评判器。该评判器在评估有害回复方面展现出优于所有现有度量标准和评判器的性能。