In the pursuit of developing Large Language Models (LLMs) that adhere to societal standards, it is imperative to discern the existence of toxicity in the generated text. The majority of existing toxicity metrics rely on encoder models trained on specific toxicity datasets. However, these encoders are susceptible to out-of-distribution (OOD) problems and depend on the definition of toxicity assumed in a dataset. In this paper, we introduce an automatic robust metric grounded on LLMs to distinguish whether model responses are toxic. We start by analyzing the toxicity factors, followed by examining the intrinsic toxic attributes of LLMs to ascertain their suitability as evaluators. Subsequently, we evaluate our metric, LLMs As ToxiciTy Evaluators (LATTE), on evaluation datasets.The empirical results indicate outstanding performance in measuring toxicity, improving upon state-of-the-art metrics by 12 points in F1 score without training procedure. We also show that upstream toxicity has an influence on downstream metrics.
翻译:在开发符合社会规范的大型语言模型(LLMs)的过程中,识别生成文本中是否存在毒性至关重要。现有大多数毒性度量依赖于在特定毒性数据集上训练的编码器模型。然而,这些编码器容易受到分布外(OOD)问题的影响,并且依赖于数据集中假定的毒性定义。本文提出一种基于LLMs的自动鲁棒度量,用于判断模型响应是否具有毒性。我们首先分析毒性因素,随后考察LLMs的内在毒性属性,以确定其作为评估者的适宜性。接着,我们在评估数据集上度量我们提出的方案——LLMs作为毒性评估者(LATTE)。实验结果表明,该度量在测量毒性方面表现优异,无需训练过程即可在F1得分上较现有最优度量提升12个百分点。我们还证明了上游毒性对下游度量存在影响。