While large language models (LLMs) have increasingly been applied to hate speech detoxification, the prompts often trigger safety alerts, causing LLMs to refuse the task. In this study, we systematically investigate false refusal behavior in hate speech detoxification and analyze the contextual and linguistic biases that trigger such refusals. We evaluate nine LLMs on both English and multilingual datasets, our results show that LLMs disproportionately refuse inputs with higher semantic toxicity and those targeting specific groups, particularly nationality, religion, and political ideology. Although multilingual datasets exhibit lower overall false refusal rates than English datasets, models still display systematic, language-dependent biases toward certain targets. Based on these findings, we propose a simple cross-translation strategy, translating English hate speech into Chinese for detoxification and back, which substantially reduces false refusals while preserving the original content, providing an effective and lightweight mitigation approach.
翻译:尽管大型语言模型(LLMs)越来越多地应用于仇恨言论净化任务,但输入提示常会触发安全警报,导致模型拒绝执行任务。本研究系统性地考察了仇恨言论净化中的错误拒绝行为,并分析了引发此类拒绝的语境与语言偏差。我们在英语及多语言数据集上评估了九个LLM,结果表明:模型对具有较高语义毒性的输入以及针对特定群体(尤其是国籍、宗教和政治意识形态)的输入存在不成比例的拒绝倾向。虽然多语言数据集整体错误拒绝率低于英语数据集,但模型仍对特定目标表现出系统性、语言依赖性的偏差。基于这些发现,我们提出一种简单的跨语言翻译策略:将英语仇恨言论翻译为中文进行净化后再回译,该方法在保持原内容的同时显著降低了错误拒绝率,为缓解该问题提供了一种高效轻量的解决方案。