RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?

Adrian de Wynter,Ishaan Watts,Tua Wongsangaroonsri,Minghui Zhang,Noura Farra,Nektar Ege Altıntoprak,Lena Baur,Samantha Claudet,Pavel Gajdusek,Can Gören,Qilong Gu,Anna Kaminska,Tomasz Kaminski,Ruby Kuo,Akiko Kyuba,Jongho Lee,Kartik Mathur,Petter Merok,Ivana Milovanović,Nani Paananen,Vesa-Matti Paananen,Anna Pavlenko,Bruno Pereira Vidal,Luciano Strika,Yueh Tsao,Davide Turcato,Oleksandr Vakhno,Judit Velcsov,Anna Vickers,Stéphanie Visser,Herdyan Widarmanto,Andrey Zaikin,Si-Qing Chen

from arxiv, AAAI 2025--camera ready + extended abstract

Large language models (LLMs) and small language models (SLMs) are being adopted at remarkable speed, although their safety still remains a serious concern. With the advent of multilingual S/LLMs, the question now becomes a matter of scale: can we expand multilingual safety evaluations of these models with the same velocity at which they are deployed? To this end, we introduce RTP-LX, a human-transcreated and human-annotated corpus of toxic prompts and outputs in 28 languages. RTP-LX follows participatory design practices, and a portion of the corpus is especially designed to detect culturally-specific toxic language. We evaluate 10 S/LLMs on their ability to detect toxic content in a culturally-sensitive, multilingual scenario. We find that, although they typically score acceptably in terms of accuracy, they have low agreement with human judges when scoring holistically the toxicity of a prompt; and have difficulty discerning harm in context-dependent scenarios, particularly with subtle-yet-harmful content (e.g. microaggressions, bias). We release this dataset to contribute to further reduce harmful uses of these models and improve their safe deployment.

翻译：大型语言模型（LLMs）和小型语言模型（SLMs）正以惊人的速度被采用，但其安全性仍然是一个严重问题。随着多语言S/LLMs的出现，问题现在变成了一个规模问题：我们能否以这些模型部署的同等速度，扩展对它们进行多语言安全性评估的能力？为此，我们引入了RTP-LX，这是一个包含28种语言的有毒提示词和输出的人工转译与人工标注语料库。RTP-LX遵循参与式设计实践，其中部分语料专门设计用于检测特定文化背景下的有毒语言。我们评估了10个S/LLMs在文化敏感的多语言场景中检测有毒内容的能力。我们发现，尽管这些模型在准确度方面通常得分尚可，但在整体评估提示词的毒性时，它们与人类评判者的共识度较低；并且在上下文依赖的场景中难以辨别危害，尤其是对于微妙但有害的内容（例如微侵犯、偏见）。我们发布此数据集，旨在为进一步减少这些模型的有害使用并改善其安全部署做出贡献。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日