Cyberbullying on social media is inherently multilingual and multi-faceted, where abusive behaviors often overlap across multiple categories. Existing methods are commonly limited by monolingual assumptions or single-task formulations, which restrict their effectiveness in realistic multilingual and multi-label scenarios. In this paper, we propose HMS-BERT, a hybrid multi-task self-training framework for multilingual and multi-label cyberbullying detection. Built upon a pretrained multilingual BERT backbone, HMS-BERT integrates contextual representations with handcrafted linguistic features and jointly optimizes a fine-grained multi-label abuse classification task and a three-class main classification task. To address labeled data scarcity in low-resource languages, an iterative self-training strategy with confidence-based pseudo-labeling is introduced to facilitate cross-lingual knowledge transfer. Experiments on four public datasets demonstrate that HMS-BERT achieves strong performance, attaining a macro F1-score of up to 0.9847 on the multi-label task and an accuracy of 0.6775 on the main classification task. Ablation studies further verify the effectiveness of the proposed components.
翻译:社交媒体上的网络欺凌本质上是多语言且多方面的,其辱骂行为通常跨越多个类别并相互重叠。现有方法通常受限于单语言假设或单任务设定,这限制了其在现实多语言与多标签场景中的有效性。本文提出HMS-BERT,一种用于多语言多标签网络欺凌检测的混合多任务自训练框架。该框架以预训练的多语言BERT为骨干网络,将上下文表征与人工构建的语言特征相融合,并联合优化细粒度多标签辱骂分类任务与三分类主任务。针对低资源语言标注数据稀缺的问题,我们引入了基于置信度的伪标注迭代自训练策略,以促进跨语言知识迁移。在四个公开数据集上的实验表明,HMS-BERT取得了优异的性能,在多标签任务上宏F1分数最高可达0.9847,在主分类任务上准确率达0.6775。消融实验进一步验证了所提各模块的有效性。