Existing cyberbullying detection benchmarks were organized by the polarity of speech, such as "offensive" and "non-offensive", which were essentially hate speech detection. However, in the real world, cyberbullying often attracted widespread social attention through incidents. To address this problem, we propose a novel annotation method to construct a cyberbullying dataset that organized by incidents. The constructed CHNCI is the first Chinese cyberbullying incident detection dataset, which consists of 220,676 comments in 91 incidents. Specifically, we first combine three cyberbullying detection methods based on explanations generation as an ensemble method to generate the pseudo labels, and then let human annotators judge these labels. Then we propose the evaluation criteria for validating whether it constitutes a cyberbullying incident. Experimental results demonstrate that the constructed dataset can be a benchmark for the tasks of cyberbullying detection and incident prediction. To the best of our knowledge, this is the first study for the Chinese cyberbullying incident detection task.
翻译:现有网络霸凌检测基准通常按言论极性(如“攻击性”与“非攻击性”)进行组织,本质上属于仇恨言论检测。然而在现实场景中,网络霸凌往往通过具体事件引发广泛社会关注。针对该问题,我们提出一种基于事件组织的新型标注方法,构建网络霸凌数据集。所构建的CHNCI是首个中文网络霸凌事件检测数据集,包含91个事件中的220,676条评论。具体而言,我们首先将三种基于解释生成的网络霸凌检测方法组合为集成方法以生成伪标签,随后由人工标注者判定这些标签;继而提出用于验证是否构成网络霸凌事件的评估标准。实验结果表明,该数据集可作为网络霸凌检测与事件预测任务的基准。据我们所知,这是关于中文网络霸凌事件检测任务的首项研究。