日本大语言模型在刻板印象触发提示中的安全性分析 (Analyzing the Safety of Japanese Large Language Models in Stereotype-Triggering Prompts)

In recent years, Large Language Models (LLMs) have attracted growing interest for their significant potential, though concerns have rapidly emerged regarding unsafe behaviors stemming from inherent stereotypes and biases.Most research on stereotypes in LLMs has primarily relied on indirect evaluation setups, in which models are prompted to select between pairs of sentences associated with particular social groups. Recently, direct evaluation methods have emerged, examining open-ended model responses to overcome limitations of previous approaches, such as annotator biases.Most existing studies have focused on English-centric LLMs, whereas research on non-English models--particularly Japanese--remains sparse, despite the growing development and adoption of these models.This study examines the safety of Japanese LLMs when responding to stereotype-triggering prompts in direct setups.We constructed 3,612 prompts by combining 301 social group terms--categorized by age, gender, and other attributes--with 12 stereotype-inducing templates in Japanese.Responses were analyzed from three foundational models trained respectively on Japanese, English, and Chinese language.Our findings reveal that LLM-jp, a Japanese native model, exhibits the lowest refusal rate and is more likely to generate toxic and negative responses compared to other models.Additionally, prompt format significantly influence the output of all models, and the generated responses include exaggerated reactions toward specific social groups, varying across models.These findings underscore the insufficient ethical safety mechanisms in Japanese LLMs and demonstrate that even high-accuracy models can produce biased outputs when processing Japanese-language prompts.We advocate for improving safety mechanisms and bias mitigation strategies in Japanese LLMs, contributing to ongoing discussions on AI ethics beyond linguistic boundaries.

翻译：近年来，大语言模型因其巨大潜力而受到日益关注，但人们也迅速对其固有刻板印象和偏见所导致的不安全行为产生担忧。关于大语言模型中刻板印象的研究大多采用间接评估框架，即通过提示模型在特定社会群体关联的句子对之间进行选择。最近，直接评估方法开始出现，通过考察模型对开放式提示的响应来克服先前方法（如标注者偏见）的局限性。现有研究主要集中在以英语为核心的大语言模型，而对非英语模型——尤其是日语模型——的研究仍然匮乏，尽管这些模型的开发和采用正在不断增长。本研究考察了日语大语言模型在直接设置下对刻板印象触发提示的响应安全性。我们通过将301个社会群体术语（按年龄、性别等属性分类）与12个日语刻板印象诱导模板相结合，构建了3,612个提示。我们分析了分别在日语、英语和中文语料上训练的三个基础模型的响应。研究结果表明，原生日语模型LLM-jp的拒绝率最低，且相比其他模型更可能生成有害和负面响应。此外，提示格式对所有模型的输出均有显著影响，生成的响应包含针对特定社会群体的夸大反应，且不同模型间存在差异。这些发现突显了日语大语言模型中伦理安全机制的不足，并表明即使是高精度模型在处理日语提示时也可能产生有偏见的输出。我们主张改进日语大语言模型的安全机制和偏见缓解策略，为超越语言界限的人工智能伦理持续讨论作出贡献。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日