Dehumanization, characterized as a subtle yet harmful manifestation of hate speech, involves denying individuals of their human qualities and often results in violence against marginalized groups. Despite significant progress in Natural Language Processing across various domains, its application in detecting dehumanizing language is limited, largely due to the scarcity of publicly available annotated data for this domain. This paper evaluates the performance of cutting-edge NLP models, including GPT-4, GPT-3.5, and LLAMA-2, in identifying dehumanizing language. Our findings reveal that while these models demonstrate potential, achieving a 70\% accuracy rate in distinguishing dehumanizing language from broader hate speech, they also display biases. They are over-sensitive in classifying other forms of hate speech as dehumanization for a specific subset of target groups, while more frequently failing to identify clear cases of dehumanization for other target groups. Moreover, leveraging one of the best-performing models, we automatically annotated a larger dataset for training more accessible models. However, our findings indicate that these models currently do not meet the high-quality data generation threshold necessary for this task.
翻译:摘要:非人化作为一种微妙但有害的仇恨言论表现形式,涉及否认个体的类人性特质,常导致针对边缘群体的暴力行为。尽管自然语言处理在多个领域取得显著进展,但其在检测非人化语言方面的应用仍十分有限,主要归因于该领域公开标注数据的匮乏。本文评估了包括GPT-4、GPT-3.5及LLAMA-2在内的前沿自然语言处理模型在识别非人化语言中的性能。研究结果表明,这些模型虽展现出潜力——在区分非人化语言与广义仇恨言论时达到70%的准确率——但也存在偏差:它们对特定目标群体的其他仇恨言论形式过度敏感地归类为非人化行为,同时更频繁地未能识别针对其他目标群体的明确非人化案例。此外,我们利用性能最优的模型之一自动标注了更大规模的数据集,用于训练更易获取的模型。然而,研究结果揭示,这些模型目前尚未达到此任务所需的高质量数据生成阈值。