In recent years, Large Language Models (LLMs) have attracted growing interest for their significant potential, though concerns have rapidly emerged regarding unsafe behaviors stemming from inherent stereotypes and biases.Most research on stereotypes in LLMs has primarily relied on indirect evaluation setups, in which models are prompted to select between pairs of sentences associated with particular social groups. Recently, direct evaluation methods have emerged, examining open-ended model responses to overcome limitations of previous approaches, such as annotator biases.Most existing studies have focused on English-centric LLMs, whereas research on non-English models--particularly Japanese--remains sparse, despite the growing development and adoption of these models.This study examines the safety of Japanese LLMs when responding to stereotype-triggering prompts in direct setups.We constructed 3,612 prompts by combining 301 social group terms--categorized by age, gender, and other attributes--with 12 stereotype-inducing templates in Japanese.Responses were analyzed from three foundational models trained respectively on Japanese, English, and Chinese language.Our findings reveal that LLM-jp, a Japanese native model, exhibits the lowest refusal rate and is more likely to generate toxic and negative responses compared to other models.Additionally, prompt format significantly influence the output of all models, and the generated responses include exaggerated reactions toward specific social groups, varying across models.These findings underscore the insufficient ethical safety mechanisms in Japanese LLMs and demonstrate that even high-accuracy models can produce biased outputs when processing Japanese-language prompts.We advocate for improving safety mechanisms and bias mitigation strategies in Japanese LLMs, contributing to ongoing discussions on AI ethics beyond linguistic boundaries.
翻译:近年来,大语言模型因其巨大潜力而受到日益关注,但人们也迅速对其固有刻板印象和偏见所导致的不安全行为产生担忧。关于大语言模型中刻板印象的研究大多采用间接评估框架,即通过提示模型在特定社会群体关联的句子对之间进行选择。最近,直接评估方法开始出现,通过考察模型对开放式提示的响应来克服先前方法(如标注者偏见)的局限性。现有研究主要集中在以英语为核心的大语言模型,而对非英语模型——尤其是日语模型——的研究仍然匮乏,尽管这些模型的开发和采用正在不断增长。本研究考察了日语大语言模型在直接设置下对刻板印象触发提示的响应安全性。我们通过将301个社会群体术语(按年龄、性别等属性分类)与12个日语刻板印象诱导模板相结合,构建了3,612个提示。我们分析了分别在日语、英语和中文语料上训练的三个基础模型的响应。研究结果表明,原生日语模型LLM-jp的拒绝率最低,且相比其他模型更可能生成有害和负面响应。此外,提示格式对所有模型的输出均有显著影响,生成的响应包含针对特定社会群体的夸大反应,且不同模型间存在差异。这些发现突显了日语大语言模型中伦理安全机制的不足,并表明即使是高精度模型在处理日语提示时也可能产生有偏见的输出。我们主张改进日语大语言模型的安全机制和偏见缓解策略,为超越语言界限的人工智能伦理持续讨论作出贡献。