Uncovering Political Hate Speech During Indian Election Campaign: A New Low-Resource Dataset and Baselines

The detection of hate speech in political discourse is a critical issue, and this becomes even more challenging in low-resource languages. To address this issue, we introduce a new dataset named IEHate, which contains 11,457 manually annotated Hindi tweets related to the Indian Assembly Election Campaign from November 1, 2021, to March 9, 2022. We performed a detailed analysis of the dataset, focusing on the prevalence of hate speech in political communication and the different forms of hateful language used. Additionally, we benchmark the dataset using a range of machine learning, deep learning, and transformer-based algorithms. Our experiments reveal that the performance of these models can be further improved, highlighting the need for more advanced techniques for hate speech detection in low-resource languages. In particular, the relatively higher score of human evaluation over algorithms emphasizes the importance of utilizing both human and automated approaches for effective hate speech moderation. Our IEHate dataset can serve as a valuable resource for researchers and practitioners working on developing and evaluating hate speech detection techniques in low-resource languages. Overall, our work underscores the importance of addressing the challenges of identifying and mitigating hate speech in political discourse, particularly in the context of low-resource languages. The dataset and resources for this work are made available at https://github.com/Farhan-jafri/Indian-Election.

翻译：政治话语中仇恨言论的检测是一个关键问题，在低资源语言中这一挑战尤为突出。为解决此问题，我们介绍了名为IEHate的新数据集，包含11,457条与2021年11月1日至2022年3月9日印度议会选举运动相关的人工标注印地语推文。我们对数据集进行了详细分析，重点关注政治沟通中仇恨言论的普遍性及其使用的不同仇恨语言形式。此外，我们利用一系列机器学习、深度学习及基于Transformer的算法对数据集进行了基准测试。实验表明，这些模型的性能仍有提升空间，突显了在低资源语言中开发更先进仇恨言论检测技术的必要性。特别是，人工评估得分相对高于算法，强调了结合人工与自动方法进行有效仇恨言论审核的重要性。我们的IEHate数据集可为从事低资源语言仇恨言论检测技术开发与评估的研究人员和从业者提供宝贵资源。总体而言，本研究强调了识别与缓解政治话语（尤其是低资源语言语境）中仇恨言论挑战的重要性。本工作的数据集与资源可从https://github.com/Farhan-jafri/Indian-Election 获取。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日