Detection of hate speech has been formulated as a standalone application of NLP and different approaches have been adopted for identifying the target groups, obtaining raw data, defining the labeling process, choosing the detection algorithm, and evaluating the performance in the desired setting. However, unlike other downstream tasks, hate speech suffers from the lack of large-sized, carefully curated, generalizable datasets owing to the highly subjective nature of the task. In this paper, we first analyze the issues surrounding hate speech detection through a data-centric lens. We then outline a holistic framework to encapsulate the data creation pipeline across seven broad dimensions by taking the specific example of hate speech towards sexual minorities. We posit that practitioners would benefit from following this framework as a form of best practice when creating hate speech datasets in the future.
翻译:仇恨言论检测已被作为自然语言处理的独立应用来构建,学界采用了不同方法对目标群体进行识别、获取原始数据、定义标注流程、选择检测算法,并在预期场景中评估性能。然而,与其他下游任务不同,由于任务本身具有高度主观性,仇恨言论领域缺乏大规模、精心策划且具有泛化能力的数据集。本文首先从数据中心的视角分析了围绕仇恨言论检测的若干问题,随后以针对性少数群体的仇恨言论为例,勾勒出一个涵盖数据创建流程七大维度的整体框架。我们认为,未来研究人员在构建仇恨言论数据集时,遵循该框架作为最佳实践将大有裨益。