IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language

Hate speech poses a significant threat to social harmony. Over the past two years, Indonesia has seen a ten-fold increase in the online hate speech ratio, underscoring the urgent need for effective detection mechanisms. However, progress is hindered by the limited availability of labeled data for Indonesian texts. The condition is even worse for marginalized minorities, such as Shia, LGBTQ, and other ethnic minorities because hate speech is underreported and less understood by detection tools. Furthermore, the lack of accommodation for subjectivity in current datasets compounds this issue. To address this, we introduce IndoToxic2024, a comprehensive Indonesian hate speech and toxicity classification dataset. Comprising 43,692 entries annotated by 19 diverse individuals, the dataset focuses on texts targeting vulnerable groups in Indonesia, specifically during the hottest political event in the country: the presidential election. We establish baselines for seven binary classification tasks, achieving a macro-F1 score of 0.78 with a BERT model (IndoBERTweet) fine-tuned for hate speech classification. Furthermore, we demonstrate how incorporating demographic information can enhance the zero-shot performance of the large language model, gpt-3.5-turbo. However, we also caution that an overemphasis on demographic information can negatively impact the fine-tuned model performance due to data fragmentation.

翻译：仇恨言论对社会和谐构成重大威胁。过去两年间，印尼网络仇恨言论比例激增十倍，凸显了建立有效检测机制的迫切性。然而，印尼语标注数据的匮乏阻碍了相关进展。对于什叶派、LGBTQ群体及其他少数族裔等边缘化群体而言，情况更为严峻，因为针对他们的仇恨言论存在上报不足且检测工具理解有限的问题。此外，现有数据集对主观性的考量不足进一步加剧了这一困境。为此，我们提出了IndoToxic2024——一个全面的印尼语仇恨言论与毒性分类数据集。该数据集包含43,692条文本条目，由19位背景各异的标注者进行标注，重点关注印尼国内最热门的政治事件（总统选举）期间针对该国弱势群体的文本内容。我们为七项二元分类任务建立了基线模型，其中专为仇恨言论分类微调的BERT模型（IndoBERTweet）取得了0.78的宏观F1分数。此外，我们证明了融入人口统计信息能够提升大语言模型gpt-3.5-turbo的零样本性能。但我们也警示，过度强调人口统计信息可能因数据碎片化而对微调模型性能产生负面影响。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日