This paper introduces JiraiBench, the first bilingual benchmark for evaluating large language models' effectiveness in detecting self-destructive content across Chinese and Japanese social media communities. Focusing on the transnational "Jirai" (landmine) online subculture that encompasses multiple forms of self-destructive behaviors including drug overdose, eating disorders, and self-harm, we present a comprehensive evaluation framework incorporating both linguistic and cultural dimensions. Our dataset comprises 10,419 Chinese posts and 5,000 Japanese posts with multidimensional annotation along three behavioral categories, achieving substantial inter-annotator agreement. Experimental evaluations across four state-of-the-art models reveal significant performance variations based on instructional language, with Japanese prompts unexpectedly outperforming Chinese prompts when processing Chinese content. This emergent cross-cultural transfer suggests that cultural proximity can sometimes outweigh linguistic similarity in detection tasks. Cross-lingual transfer experiments with fine-tuned models further demonstrate the potential for knowledge transfer between these language systems without explicit target language training. These findings highlight the need for culturally-informed approaches to multilingual content moderation and provide empirical evidence for the importance of cultural context in developing more effective detection systems for vulnerable online communities.
翻译:本文介绍了JiraiBench,这是首个用于评估大型语言模型在中日社交媒体社区中检测自毁内容有效性的双语基准。聚焦于跨国"Jirai"(地雷)网络亚文化——该文化涵盖药物过量、饮食失调和自伤等多种自毁行为形式,我们提出了一个融合语言与文化维度的综合评估框架。我们的数据集包含10,419条中文帖文和5,000条日文帖文,通过三个行为类别进行多维度标注,并实现了较高的标注者间一致性。对四个前沿模型的实验评估显示,基于指令语言存在显著的性能差异:在处理中文内容时,日文提示词意外地优于中文提示词。这种新兴的跨文化迁移现象表明,在检测任务中文化邻近性有时可能超越语言相似性。通过微调模型进行的跨语言迁移实验进一步证明,无需显式目标语言训练即可实现这些语言系统间的知识迁移。这些发现凸显了采用文化认知方法进行多语言内容审核的必要性,并为文化语境在开发更有效的脆弱在线社区检测系统中的重要性提供了实证依据。