Recent advances in large language models (LLMs) have led to their extensive global deployment, and ensuring their safety calls for comprehensive and multilingual toxicity evaluations. However, existing toxicity benchmarks are overwhelmingly focused on English, posing serious risks to deploying LLMs in other languages. We address this by introducing PolygloToxicityPrompts (PTP), the first large-scale multilingual toxicity evaluation benchmark of 425K naturally occurring prompts spanning 17 languages. We overcome the scarcity of naturally occurring toxicity in web-text and ensure coverage across languages with varying resources by automatically scraping over 100M web-text documents. Using PTP, we investigate research questions to study the impact of model size, prompt language, and instruction and preference-tuning methods on toxicity by benchmarking over 60 LLMs. Notably, we find that toxicity increases as language resources decrease or model size increases. Although instruction- and preference-tuning reduce toxicity, the choice of preference-tuning method does not have any significant impact. Our findings shed light on crucial shortcomings of LLM safeguarding and highlight areas for future research.
翻译:近年来,大语言模型的进展使其在全球范围内广泛部署,确保其安全性需要全面且多语言的毒性评估。然而,现有毒性基准主要集中在英语上,这给在其他语言中部署大语言模型带来了严重风险。为解决这一问题,我们提出了PolygloToxicityPrompts(PTP)——首个大规模多语言毒性评估基准,包含跨越17种语言的42.5万个自然出现的提示。我们克服了网络文本中自然毒性稀缺的问题,并通过自动抓取超过1亿份网络文本文档,确保了不同资源水平语言的覆盖。利用PTP,我们对60多个大语言模型进行了基准测试,研究了模型规模、提示语言以及指令与偏好调优方法对毒性的影响。值得注意的是,我们发现随着语言资源减少或模型规模增大,毒性随之增加。尽管指令调优和偏好调优能降低毒性,但偏好调优方法的选择并未产生显著影响。我们的研究揭示了大语言模型防护机制的关键缺陷,并为未来研究指明了方向。