Identifying offensive content in social media is vital for creating safe online communities. Several recent studies have addressed this problem by creating datasets for various languages. In this paper, we explore offensive language identification in texts with transliterations and code-mixing, linguistic phenomena common in multilingual societies, and a known challenge for NLP systems. We introduce TB-OLID, a transliterated Bangla offensive language dataset containing 5,000 manually annotated comments. We train and fine-tune machine learning models on TB-OLID, and we evaluate their results on this dataset. Our results show that English pre-trained transformer-based models, such as fBERT and HateBERT achieve the best performance on this dataset.
翻译:社交媒体中的攻击性内容识别对于创建安全的在线社区至关重要。近期多项研究通过构建不同语言的数据集来解决这一问题。本文聚焦于多语社会常见语言现象——转写与语码混合文本中的攻击性语言识别,这是自然语言处理系统面临的已知挑战。我们提出了TB-OLID,一个包含5,000条人工标注评论的转写孟加拉语攻击性语言数据集。我们在TB-OLID上训练并微调机器学习模型,并评估其在该数据集上的表现。实验结果表明,基于英语预训练Transformer的模型(如fBERT和HateBERT)在该数据集上取得了最佳性能。