In the pursuit of developing Large Language Models (LLMs) that adhere to societal standards, it is imperative to detect the toxicity in the generated text. The majority of existing toxicity metrics rely on encoder models trained on specific toxicity datasets, which are susceptible to out-of-distribution (OOD) problems and depend on the dataset's definition of toxicity. In this paper, we introduce a robust metric grounded on LLMs to flexibly measure toxicity according to the given definition. We first analyze the toxicity factors, followed by an examination of the intrinsic toxic attributes of LLMs to ascertain their suitability as evaluators. Finally, we evaluate the performance of our metric with detailed analysis. Our empirical results demonstrate outstanding performance in measuring toxicity within verified factors, improving on conventional metrics by 12 points in the F1 score. Our findings also indicate that upstream toxicity significantly influences downstream metrics, suggesting that LLMs are unsuitable for toxicity evaluations within unverified factors.
翻译:在开发符合社会标准的大型语言模型(LLMs)的过程中,检测生成文本的毒性至关重要。现有毒性度量方法大多依赖于在特定毒性数据集上训练的编码器模型,这些方法易受分布外(OOD)问题影响,且依赖于数据集对毒性的定义。本文提出一种基于LLMs的鲁棒性度量方法,可根据给定定义灵活评估毒性。我们首先解析毒性构成要素,继而检验LLMs的内在毒性属性以确认其作为评估工具的适用性。最后通过详细分析评估所提度量方法的性能。实验结果表明,该方法在已验证毒性要素的度量中表现卓越,F1分数较传统度量方法提升12个百分点。研究还发现上游毒性对下游度量具有显著影响,表明LLMs不适用于未验证要素的毒性评估。