In this paper, we propose a methodology for task 10 of SemEval23, focusing on detecting and classifying online sexism in social media posts. The task is tackling a serious issue, as detecting harmful content on social media platforms is crucial for mitigating the harm of these posts on users. Our solution for this task is based on an ensemble of fine-tuned transformer-based models (BERTweet, RoBERTa, and DeBERTa). To alleviate problems related to class imbalance, and to improve the generalization capability of our model, we also experiment with data augmentation and semi-supervised learning. In particular, for data augmentation, we use back-translation, either on all classes, or on the underrepresented classes only. We analyze the impact of these strategies on the overall performance of the pipeline through extensive experiments. while for semi-supervised learning, we found that with a substantial amount of unlabelled, in-domain data available, semi-supervised learning can enhance the performance of certain models. Our proposed method (for which the source code is available on Github attains an F1-score of 0.8613 for sub-taskA, which ranked us 10th in the competition
翻译:本文针对SemEval-2023任务10提出了一种方法论,聚焦于社交媒体帖子中在线性别歧视的检测与分类。该任务旨在解决一个严峻问题——检测社交媒体平台上的有害内容对于减轻这些帖子对用户的伤害至关重要。我们的解决方案基于微调后的Transformer模型集成(BERTweet、RoBERTa和DeBERTa)。为缓解类别不平衡问题并提升模型的泛化能力,我们还尝试了数据增强与半监督学习技术。具体而言,数据增强方面采用了回译方法,分别应用于全部类别或仅少数类别。通过大量实验分析了这些策略对整体管道性能的影响;在半监督学习中,我们发现当拥有大量未标注的领域内数据时,半监督学习可增强特定模型的表现。我们提出的方法(源代码已公开于Github)在子任务A上取得了0.8613的F1分数,位列竞赛第10名。