Large language models (LLMs) and small language models (SLMs) are being adopted at remarkable speed, although their safety still remains a serious concern. With the advent of multilingual S/LLMs, the question now becomes a matter of scale: can we expand multilingual safety evaluations of these models with the same velocity at which they are deployed? To this end, we introduce RTP-LX, a human-transcreated and human-annotated corpus of toxic prompts and outputs in 28 languages. RTP-LX follows participatory design practices, and a portion of the corpus is especially designed to detect culturally-specific toxic language. We evaluate 10 S/LLMs on their ability to detect toxic content in a culturally-sensitive, multilingual scenario. We find that, although they typically score acceptably in terms of accuracy, they have low agreement with human judges when scoring holistically the toxicity of a prompt; and have difficulty discerning harm in context-dependent scenarios, particularly with subtle-yet-harmful content (e.g. microaggressions, bias). We release this dataset to contribute to further reduce harmful uses of these models and improve their safe deployment.
翻译:大型语言模型(LLMs)和小型语言模型(SLMs)正以惊人的速度被采用,但其安全性仍然是一个严重问题。随着多语言S/LLMs的出现,问题现在变成了一个规模问题:我们能否以这些模型部署的同等速度,扩展对它们进行多语言安全性评估的能力?为此,我们引入了RTP-LX,这是一个包含28种语言的有毒提示词和输出的人工转译与人工标注语料库。RTP-LX遵循参与式设计实践,其中部分语料专门设计用于检测特定文化背景下的有毒语言。我们评估了10个S/LLMs在文化敏感的多语言场景中检测有毒内容的能力。我们发现,尽管这些模型在准确度方面通常得分尚可,但在整体评估提示词的毒性时,它们与人类评判者的共识度较低;并且在上下文依赖的场景中难以辨别危害,尤其是对于微妙但有害的内容(例如微侵犯、偏见)。我们发布此数据集,旨在为进一步减少这些模型的有害使用并改善其安全部署做出贡献。