The prevalence of digital media and evolving sociopolitical dynamics have significantly amplified the dissemination of hateful content. Existing studies mainly focus on classifying texts into binary categories, often overlooking the continuous spectrum of offensiveness and hatefulness inherent in the text. In this research, we present an extensive benchmark dataset for Amharic, comprising 8,258 tweets annotated for three distinct tasks: category classification, identification of hate targets, and rating offensiveness and hatefulness intensities. Our study highlights that a considerable majority of tweets belong to the less offensive and less hate intensity levels, underscoring the need for early interventions by stakeholders. The prevalence of ethnic and political hatred targets, with significant overlaps in our dataset, emphasizes the complex relationships within Ethiopia's sociopolitical landscape. We build classification and regression models and investigate the efficacy of models in handling these tasks. Our results reveal that hate and offensive speech can not be addressed by a simplistic binary classification, instead manifesting as variables across a continuous range of values. The Afro-XLMR-large model exhibits the best performances achieving F1-scores of 75.30%, 70.59%, and 29.42% for the category, target, and regression tasks, respectively. The 80.22% correlation coefficient of the Afro-XLMR-large model indicates strong alignments.
翻译:数字媒体的普及与社会政治动态的演变显著加剧了仇恨内容的传播。现有研究主要关注文本的二分类任务,往往忽略文本中固有的攻击性与仇恨性连续光谱特征。本研究构建了阿姆哈拉语的基准数据集,包含8,258条推文,标注了三个独立任务:类别分类、仇恨目标识别以及攻击性与仇恨强度评分。研究表明,绝大多数推文属于低攻击性和低仇恨强度层级,凸显了利益相关方早期干预的必要性。数据集中种族与政治仇恨目标的高度重合,揭示了埃塞俄比亚社会政治格局中的复杂关系。我们构建了分类与回归模型,并探究模型处理这些任务的有效性。结果表明,仇恨与攻击性言论无法通过简单的二分类解决,而应以连续值变量的形式呈现。Afro-XLMR-large模型在类别、目标和回归任务中分别取得了75.30%、70.59%和29.42%的F1分数,其80.22%的相关系数展现了高度一致性。