Harmful Speech Detection by Language Models Exhibits Gender-Queer Dialect Bias

Content moderation on social media platforms shapes the dynamics of online discourse, influencing whose voices are amplified and whose are suppressed. Recent studies have raised concerns about the fairness of content moderation practices, particularly for aggressively flagging posts from transgender and non-binary individuals as toxic. In this study, we investigate the presence of bias in harmful speech classification of gender-queer dialect online, focusing specifically on the treatment of reclaimed slurs. We introduce a novel dataset, QueerReclaimLex, based on 109 curated templates exemplifying non-derogatory uses of LGBTQ+ slurs. Dataset instances are scored by gender-queer annotators for potential harm depending on additional context about speaker identity. We systematically evaluate the performance of five off-the-shelf language models in assessing the harm of these texts and explore the effectiveness of chain-of-thought prompting to teach large language models (LLMs) to leverage author identity context. We reveal a tendency for these models to inaccurately flag texts authored by gender-queer individuals as harmful. Strikingly, across all LLMs the performance is poorest for texts that show signs of being written by individuals targeted by the featured slur (F1 <= 0.24). We highlight an urgent need for fairness and inclusivity in content moderation systems. By uncovering these biases, this work aims to inform the development of more equitable content moderation practices and contribute to the creation of inclusive online spaces for all users.

翻译：社交媒体平台的内容审核塑造了在线话语的动态，影响着哪些声音被放大、哪些被压制。近期研究对内容审核实践的公平性提出了担忧，特别是针对跨性别和非二元性别用户帖子的过度毒性标记问题。本研究调查了在线性别酷儿方言在有害言论分类中存在的偏见，重点关注被重新赋义的侮辱性用语的处理方式。我们引入了一个新颖的数据集QueerReclaimLex，该数据集基于109个精心设计的模板，展示了LGBTQ+侮辱性词汇的非贬义用法。数据集实例由性别酷儿标注者根据说话者身份背景信息进行潜在危害性评分。我们系统评估了五种现成语言模型在评估这些文本危害性时的表现，并探索了思维链提示在教导大语言模型利用作者身份背景方面的有效性。研究发现这些模型存在将性别酷儿个体创作的文本错误标记为有害的倾向。值得注意的是，在所有大语言模型中，对于具有被讨论侮辱性词汇目标群体特征个体所撰写的文本，模型性能最差（F1分数≤0.24）。本研究强调了内容审核系统对公平性和包容性的迫切需求。通过揭示这些偏见，本工作旨在为开发更公平的内容审核实践提供参考，并为创建包容所有用户的在线空间做出贡献。