BAN-PL: a Novel Polish Dataset of Banned Harmful and Offensive Content from Wykop.pl web service

Since the Internet is flooded with hate, it is one of the main tasks for NLP experts to master automated online content moderation. However, advancements in this field require improved access to publicly available accurate and non-synthetic datasets of social media content. For the Polish language, such resources are very limited. In this paper, we address this gap by presenting a new open dataset of offensive social media content for the Polish language. The dataset comprises content from Wykop.pl, a popular online service often referred to as the "Polish Reddit", reported by users and banned in the internal moderation process. It contains a total of 691,662 posts and comments, evenly divided into two categories: "harmful" and "neutral" ("non-harmful"). The anonymized subset of the BAN-PL dataset consisting on 24,000 pieces (12,000 for each class), along with preprocessing scripts have been made publicly available. Furthermore the paper offers valuable insights into real-life content moderation processes and delves into an analysis of linguistic features and content characteristics of the dataset. Moreover, a comprehensive anonymization procedure has been meticulously described and applied. The prevalent biases encountered in similar datasets, including post-moderation and pre-selection biases, are also discussed.

翻译：由于互联网充斥着仇恨言论，掌握自动化在线内容审核成为自然语言处理专家的重要任务之一。然而，该领域的进展需要更便捷地获取公开、准确且非合成的社交媒体内容数据集。针对波兰语，此类资源极为有限。本文通过发布一个新的波兰语攻击性社交媒体内容开源数据集来填补这一空白。该数据集包含来自Wykop.pl（常被称为“波兰版Reddit”的流行在线服务）的用户举报并经内部审核流程禁止的内容，总计691,662条帖子和评论，均匀分为“有害”和“中性（非有害）”两类。数据集的匿名化子集（包含24,000条样本，每类12,000条）及预处理脚本已公开提供。此外，本文提供了对真实内容审核流程的深刻见解，深入分析了数据集的语言特征和内容特性，并详细描述和实施了全面的匿名化流程。同时，还讨论了类似数据集中常见的偏差，包括后审核偏差和预选择偏差。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日