The goal of sexism detection is to mitigate negative online content targeting certain gender groups of people. However, the limited availability of labeled sexism-related datasets makes it problematic to identify online sexism for low-resource languages. In this paper, we address the task of automatic sexism detection in social media for one low-resource language -- Chinese. Rather than collecting new sexism data or building cross-lingual transfer learning models, we develop a cross-lingual domain-aware semantic specialisation system in order to make the most of existing data. Semantic specialisation is a technique for retrofitting pre-trained distributional word vectors by integrating external linguistic knowledge (such as lexico-semantic relations) into the specialised feature space. To do this, we leverage semantic resources for sexism from a high-resource language (English) to specialise pre-trained word vectors in the target language (Chinese) to inject domain knowledge. We demonstrate the benefit of our sexist word embeddings (SexWEs) specialised by our framework via intrinsic evaluation of word similarity and extrinsic evaluation of sexism detection. Compared with other specialisation approaches and Chinese baseline word vectors, our SexWEs shows an average score improvement of 0.033 and 0.064 in both intrinsic and extrinsic evaluations, respectively. The ablative results and visualisation of SexWEs also prove the effectiveness of our framework on retrofitting word vectors in low-resource languages.
翻译:性别歧视检测的目标是减少针对特定性别群体的负面网络内容。然而,由于标注的性别歧视相关数据集的有限性,这使得低资源语言中的在线性别歧视识别变得困难。本文针对一种低资源语言——中文,解决社交媒体中的自动性别歧视检测任务。我们并未收集新的性别歧视数据或构建跨语言迁移学习模型,而是开发了一个跨语言领域感知的语义专精系统,以最大化利用现有数据。语义专精是一种通过将外部语言知识(如词汇语义关系)集成到专精特征空间中来改造预训练分布式词向量的技术。为此,我们利用高资源语言(英语)中的性别歧视语义资源,对目标语言(中文)的预训练词向量进行专精,以注入领域知识。我们通过词相似度的内在评估和性别歧视检测的外在评估,展示了通过本框架专精的性别歧视词嵌入(SexWEs)的优势。与其他专精方法和中文基线词向量相比,我们的SexWEs在内在评估和外评估中分别取得了平均0.033和0.064的分数提升。SexWEs的消融实验结果和可视化也证明了本框架在低资源语言中改造词向量的有效性。