Since the Internet is flooded with hate, it is one of the main tasks for NLP experts to master automated online content moderation. However, advancements in this field require improved access to publicly available accurate and non-synthetic datasets of social media content. For the Polish language, such resources are very limited. In this paper, we address this gap by presenting a new open dataset of offensive social media content for the Polish language. The dataset comprises content from Wykop.pl, a popular online service often referred to as the "Polish Reddit", reported by users and banned in the internal moderation process. It contains a total of 691,662 posts and comments, evenly divided into two categories: "harmful" and "neutral" ("non-harmful"). The anonymized subset of the BAN-PL dataset consisting on 24,000 pieces (12,000 for each class), along with preprocessing scripts have been made publicly available. Furthermore the paper offers valuable insights into real-life content moderation processes and delves into an analysis of linguistic features and content characteristics of the dataset. Moreover, a comprehensive anonymization procedure has been meticulously described and applied. The prevalent biases encountered in similar datasets, including post-moderation and pre-selection biases, are also discussed.
翻译:由于互联网充斥着仇恨言论,掌握自动化在线内容审核成为自然语言处理专家的重要任务之一。然而,该领域的进展需要更便捷地获取公开、准确且非合成的社交媒体内容数据集。针对波兰语,此类资源极为有限。本文通过发布一个新的波兰语攻击性社交媒体内容开源数据集来填补这一空白。该数据集包含来自Wykop.pl(常被称为“波兰版Reddit”的流行在线服务)的用户举报并经内部审核流程禁止的内容,总计691,662条帖子和评论,均匀分为“有害”和“中性(非有害)”两类。数据集的匿名化子集(包含24,000条样本,每类12,000条)及预处理脚本已公开提供。此外,本文提供了对真实内容审核流程的深刻见解,深入分析了数据集的语言特征和内容特性,并详细描述和实施了全面的匿名化流程。同时,还讨论了类似数据集中常见的偏差,包括后审核偏差和预选择偏差。