The detection of hate speech in political discourse is a critical issue, and this becomes even more challenging in low-resource languages. To address this issue, we introduce a new dataset named IEHate, which contains 11,457 manually annotated Hindi tweets related to the Indian Assembly Election Campaign from November 1, 2021, to March 9, 2022. We performed a detailed analysis of the dataset, focusing on the prevalence of hate speech in political communication and the different forms of hateful language used. Additionally, we benchmark the dataset using a range of machine learning, deep learning, and transformer-based algorithms. Our experiments reveal that the performance of these models can be further improved, highlighting the need for more advanced techniques for hate speech detection in low-resource languages. In particular, the relatively higher score of human evaluation over algorithms emphasizes the importance of utilizing both human and automated approaches for effective hate speech moderation. Our IEHate dataset can serve as a valuable resource for researchers and practitioners working on developing and evaluating hate speech detection techniques in low-resource languages. Overall, our work underscores the importance of addressing the challenges of identifying and mitigating hate speech in political discourse, particularly in the context of low-resource languages. The dataset and resources for this work are made available at https://github.com/Farhan-jafri/Indian-Election.
翻译:政治话语中仇恨言论的检测是一个关键问题,在低资源语言中这一挑战尤为突出。为解决此问题,我们介绍了名为IEHate的新数据集,包含11,457条与2021年11月1日至2022年3月9日印度议会选举运动相关的人工标注印地语推文。我们对数据集进行了详细分析,重点关注政治沟通中仇恨言论的普遍性及其使用的不同仇恨语言形式。此外,我们利用一系列机器学习、深度学习及基于Transformer的算法对数据集进行了基准测试。实验表明,这些模型的性能仍有提升空间,突显了在低资源语言中开发更先进仇恨言论检测技术的必要性。特别是,人工评估得分相对高于算法,强调了结合人工与自动方法进行有效仇恨言论审核的重要性。我们的IEHate数据集可为从事低资源语言仇恨言论检测技术开发与评估的研究人员和从业者提供宝贵资源。总体而言,本研究强调了识别与缓解政治话语(尤其是低资源语言语境)中仇恨言论挑战的重要性。本工作的数据集与资源可从https://github.com/Farhan-jafri/Indian-Election 获取。