One of the major challenges in automatic hate speech detection is the lack of datasets that cover a wide range of biased and unbiased messages and that are consistently labeled. We propose a labeling procedure that addresses some of the common weaknesses of labeled datasets. We focus on antisemitic speech on Twitter and create a labeled dataset of 6,941 tweets that cover a wide range of topics common in conversations about Jews, Israel, and antisemitism between January 2019 and December 2021 by drawing from representative samples with relevant keywords. Our annotation process aims to strictly apply a commonly used definition of antisemitism by forcing annotators to specify which part of the definition applies, and by giving them the option to personally disagree with the definition on a case-by-case basis. Labeling tweets that call out antisemitism, report antisemitism, or are otherwise related to antisemitism (such as the Holocaust) but are not actually antisemitic can help reduce false positives in automated detection. The dataset includes 1,250 tweets (18%) that are antisemitic according to the International Holocaust Remembrance Alliance (IHRA) definition of antisemitism. It is important to note, however, that the dataset is not comprehensive. Many topics are still not covered, and it only includes tweets collected from Twitter between January 2019 and December 2021. Additionally, the dataset only includes tweets that were written in English. Despite these limitations, we hope that this is a meaningful contribution to improving the automated detection of antisemitic speech.
翻译:自动仇恨言论检测的主要挑战之一,是缺乏覆盖广泛偏见与非偏见信息且标注一致的公开数据集。本文提出一种标注流程,旨在解决现有标注数据集的常见缺陷。我们聚焦于Twitter平台上的反犹太主义言论,通过从相关关键词的代表性样本中抽取数据,创建了一个包含6,941条推文的标注数据集,涵盖2019年1月至2021年12月期间关于犹太人、以色列及反犹太主义讨论中的常见主题。我们的标注流程力求严格应用通用的反犹太主义定义,要求标注者明确标注适用的定义条款,并允许其在个案基础上对定义持有不同意见。标注那些揭露反犹太主义、报告反犹太主义或与反犹太主义相关(如大屠杀)但实际并非反犹太主义的推文,有助于减少自动检测中的误报。该数据集中有1,250条推文(18%)根据国际大屠杀纪念联盟(IHRA)对反犹太主义的定义被判定为反犹太主义内容。但需注意,本数据集并不全面:许多主题尚未覆盖,且仅包含2019年1月至2021年12月期间从Twitter收集的英文推文。尽管存在这些局限,我们仍希望这对改进反犹太主义言论自动检测技术有所助益。