MisgenderMender: A Community-Informed Approach to Interventions for Misgendering

Content Warning: This paper contains examples of misgendering and erasure that could be offensive and potentially triggering. Misgendering, the act of incorrectly addressing someone's gender, inflicts serious harm and is pervasive in everyday technologies, yet there is a notable lack of research to combat it. We are the first to address this lack of research into interventions for misgendering by conducting a survey of gender-diverse individuals in the US to understand perspectives about automated interventions for text-based misgendering. Based on survey insights on the prevalence of misgendering, desired solutions, and associated concerns, we introduce a misgendering interventions task and evaluation dataset, MisgenderMender. We define the task with two sub-tasks: (i) detecting misgendering, followed by (ii) correcting misgendering where misgendering is present in domains where editing is appropriate. MisgenderMender comprises 3790 instances of social media content and LLM-generations about non-cisgender public figures, annotated for the presence of misgendering, with additional annotations for correcting misgendering in LLM-generated text. Using this dataset, we set initial benchmarks by evaluating existing NLP systems and highlighting challenges for future models to address. We release the full dataset, code, and demo at https://tamannahossainkay.github.io/misgendermender/.

翻译：内容警告：本文包含可能引发冒犯或触发情绪的性别误称及抹杀性内容。性别误称——即错误指代他人性别的行为——会造成严重伤害，且在日常技术中普遍存在，但目前显著缺乏应对该问题的研究。我们首次通过调研美国性别多元化群体对文本性别误称自动化干预措施的看法，填补了这一研究空白。基于对性别误称普遍性、理想解决方案及相关顾虑的调研洞察，我们提出了性别误称干预任务及其评估数据集MisgenderMender。该任务包含两个子任务：（i）检测性别误称，（ii）在允许编辑的领域中对已存在的性别误称进行纠正。MisgenderMender包含3790条社交媒体内容及关于非顺性别公众人物的LLM生成文本，标注了性别误称的存在情况，并额外标注了对LLM生成文本中性别误称的纠正方法。利用该数据集，我们通过评估现有自然语言处理系统设定初始基准，并揭示了未来模型需应对的挑战。我们已在https://tamannahossainkay.github.io/misgendermender/ 发布完整数据集、代码及演示。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一种无需使用负样本的自监督学习方法，Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

专知会员服务

15+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日