MetaHate: A Dataset for Unifying Efforts on Hate Speech Detection

Hate speech represents a pervasive and detrimental form of online discourse, often manifested through an array of slurs, from hateful tweets to defamatory posts. As such speech proliferates, it connects people globally and poses significant social, psychological, and occasionally physical threats to targeted individuals and communities. Current computational linguistic approaches for tackling this phenomenon rely on labelled social media datasets for training. For unifying efforts, our study advances in the critical need for a comprehensive meta-collection, advocating for an extensive dataset to help counteract this problem effectively. We scrutinized over 60 datasets, selectively integrating those pertinent into MetaHate. This paper offers a detailed examination of existing collections, highlighting their strengths and limitations. Our findings contribute to a deeper understanding of the existing datasets, paving the way for training more robust and adaptable models. These enhanced models are essential for effectively combating the dynamic and complex nature of hate speech in the digital realm.

翻译：仇恨言论是一种普遍且有害的网络话语形式，常表现为一系列辱骂性内容，从恶意推文到毁谤性帖子不一而足。随着此类言论的扩散，它既连接全球用户，又对目标个人及社群构成严重的社会、心理乃至生理威胁。当前应对这一现象的计算语言学方法依赖于带标注的社交媒体数据集进行训练。为统一各方工作，本研究推进了一项关键需求——构建综合性元数据集，倡导通过大规模数据集有效遏制这一问题。我们审阅了60余个数据集，选择性整合其中相关部分形成MetaHate。本文对现有数据集进行了详细考察，突出其优势与局限。研究结果有助于深入理解现有数据集，为训练更稳健、适应性更强的模型铺平道路。这些增强型模型对于有效对抗数字领域中仇恨言论的动态性与复杂性至关重要。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日