Recent research has highlighted that assigning specific personas to large language models (LLMs) can significantly increase harmful content generation. However, limited attention has been given to persona-driven toxicity in non-Western contexts, particularly in Chinese-based LLMs. In this paper, we perform a large-scale, cross-model analysis of refusal behavior and persona-driven toxicity amplification across four Chinese LLMs, leveraging a comprehensive dataset of over 1,400,000 generated texts. We identify significant disparities in persona-driven refusal behavior, including systematic gender differences in refusal triggering across the evaluated Chinese LLMs. Furthermore, we provide quantitative evidence of persona-driven toxicity amplification with respect to model default baselines. We show that this amplification--whose magnitude varies substantially across models--is driven by interactions across several factors, involving persona conditioning, prompting strategy, target social group, and model-specific safety mechanisms. Leveraging model-specific regression analyses, we systematically characterize how persona categories, target social groups, and prompt templates independently and jointly shape both refusal behavior and output toxicity. As a complementary case study, we further explore an iterative, evaluator-guided mitigation strategy based on model feedback with an external LLM evaluator, demonstrating that highly toxic outputs can be substantially reduced without costly model retraining. Overall, our findings highlight the importance of culturally contextualized safety evaluations for Chinese-language LLMs and provide a structured framework for assessing persona-induced risks and exploratory mitigation strategies in LLM-generated content.
翻译:近期研究表明,为大型语言模型(LLMs)分配特定角色会显著增加有害内容的生成。然而,角色驱动的毒性在非西方语境(尤其是基于中文的LLMs)中尚未得到充分关注。本文对四种中文LLMs开展了大规模跨模型分析,研究其拒绝行为与角色驱动毒性放大现象,并基于超过140万条生成文本构建的综合数据集进行实证。我们发现了角色驱动拒绝行为中的显著差异,包括所评估中文LLMs在拒绝触发机制上存在的系统性性别差异。此外,我们提供了角色驱动毒性相对于模型默认基线放大的量化证据,证明这种跨模型幅度差异显著的放大效应,是由角色条件设置、提示策略、目标社会群体及模型特定安全机制等多因素交互作用驱动的。通过模型特定回归分析,我们系统刻画了角色类别、目标社会群体与提示模板如何独立或联合塑造拒绝行为与输出毒性。作为补充案例,我们进一步探索了基于外部LLM评估器反馈的迭代式评估导向缓解策略,证明无需昂贵的模型重训练即可大幅降低高毒性输出。总体而言,研究结果凸显了对中文LLMs进行文化情境化安全评估的重要性,并为评估LLM生成内容中角色诱导风险及探索性缓解策略提供了结构化框架。