Current research on hate speech analysis is typically oriented towards monolingual and single classification tasks. In this paper, we present a new multilingual hate speech analysis dataset for English, Hindi, Arabic, French, German and Spanish languages for multiple domains across hate speech - Abuse, Racism, Sexism, Religious Hate and Extremism. To the best of our knowledge, this paper is the first to address the problem of identifying various types of hate speech in these five wide domains in these six languages. In this work, we describe how we created the dataset, created annotations at high level and low level for different domains and how we use it to test the current state-of-the-art multilingual and multitask learning approaches. We evaluate our dataset in various monolingual, cross-lingual and machine translation classification settings and compare it against open source English datasets that we aggregated and merged for this task. Then we discuss how this approach can be used to create large scale hate-speech datasets and how to leverage our annotations in order to improve hate speech detection and classification in general.
翻译:当前仇恨言论分析研究通常面向单语言和单一分类任务。本文提出一个新的多语言仇恨言论分析数据集,涵盖英语、印地语、阿拉伯语、法语、德语和西班牙语六种语言,涉及仇恨言论的五个广泛领域:辱骂、种族主义、性别歧视、宗教仇恨和极端主义。据我们所知,本文首次针对这六种语言中的五个广泛领域,解决识别不同类型仇恨言论的问题。本文描述了如何构建该数据集,如何为不同领域创建高层级和低层级标注,以及如何利用该数据集测试当前最先进的多语言和多任务学习方法。我们在多种单语言、跨语言和机器翻译分类设置下评估该数据集,并将其与为本文任务聚合合并的开源英语数据集进行对比。最后,本文探讨了如何利用该方法构建大规模仇恨言论数据集,以及如何利用我们的标注来整体改进仇恨言论检测与分类。