Cross-lingual transfer learning from high-resource to medium and low-resource languages has shown encouraging results. However, the scarcity of resources in target languages remains a challenge. In this work, we resort to data augmentation and continual pre-training for domain adaptation to improve cross-lingual abusive language detection. For data augmentation, we analyze two existing techniques based on vicinal risk minimization and propose MIXAG, a novel data augmentation method which interpolates pairs of instances based on the angle of their representations. Our experiments involve seven languages typologically distinct from English and three different domains. The results reveal that the data augmentation strategies can enhance few-shot cross-lingual abusive language detection. Specifically, we observe that consistently in all target languages, MIXAG improves significantly in multidomain and multilingual environments. Finally, we show through an error analysis how the domain adaptation can favour the class of abusive texts (reducing false negatives), but at the same time, declines the precision of the abusive language detection model.
翻译:从高资源语言向中低资源语言的跨语言迁移学习已展现出令人鼓舞的结果。然而,目标语言中的资源稀缺仍是一大挑战。在本研究中,我们采用数据增强和持续预训练进行领域自适应,以提升跨语言辱骂性语言检测的性能。针对数据增强,我们分析了两种基于邻域风险最小化的现有技术,并提出了一种名为MIXAG的新型数据增强方法,该方法基于实例表示的夹角对实例对进行插值。我们的实验涉及七种与英语类型学上显著不同的语言及三个不同领域。结果表明,数据增强策略能够增强少样本跨语言辱骂性语言检测的效果。具体而言,我们观察到在所有目标语言中,MIXAG在多领域和多语言环境下始终表现出显著的改进。最后,通过误差分析,我们揭示了领域自适应如何有利于辱骂性文本类别(减少假阴性),但同时会降低辱骂性语言检测模型的精确率。