Abusive language is a concerning problem in online social media. Past research on detecting abusive language covers different platforms, languages, demographies, etc. However, models trained using these datasets do not perform well in cross-domain evaluation settings. To overcome this, a common strategy is to use a few samples from the target domain to train models to get better performance in that domain (cross-domain few-shot training). However, this might cause the models to overfit the artefacts of those samples. A compelling solution could be to guide the models toward rationales, i.e., spans of text that justify the text's label. This method has been found to improve model performance in the in-domain setting across various NLP tasks. In this paper, we propose RGFS (Rationale-Guided Few-Shot Classification) for abusive language detection. We first build a multitask learning setup to jointly learn rationales, targets, and labels, and find a significant improvement of 6% macro F1 on the rationale detection task over training solely rationale classifiers. We introduce two rationale-integrated BERT-based architectures (the RGFS models) and evaluate our systems over five different abusive language datasets, finding that in the few-shot classification setting, RGFS-based models outperform baseline models by about 7% in macro F1 scores and perform competitively to models finetuned on other source domains. Furthermore, RGFS-based models outperform LIME/SHAP-based approaches in terms of plausibility and are close in performance in terms of faithfulness.
翻译:恶意语言是网络社交媒体中一个令人担忧的问题。过去关于恶意语言检测的研究涵盖了不同的平台、语言、人口统计特征等。然而,使用这些数据集训练的模型在跨领域评估场景中表现不佳。为了克服这一难题,一种常见策略是从目标领域中使用少量样本训练模型,以提升在该领域的表现(跨领域小样本训练)。但这可能导致模型过拟合这些样本的人为痕迹。一个可行的解决方案是引导模型关注理由(即证明文本标签合理性的文本片段)。这种方法已被证明能提升多种NLP任务在同领域场景中的模型性能。本文提出RGFS(理由引导的小样本分类)用于恶意语言检测。我们首先构建多任务学习框架,联合学习理由、目标与标签,发现理由检测任务的宏F1分数较单纯训练理由分类器提升了6%。我们提出了两种基于BERT的理由集成架构(RGFS模型),并在五个不同的恶意语言数据集上评估系统。实验发现,在小样本分类场景中,RGFS模型在宏F1分数上比基线模型高出约7%,且表现与在其他源领域微调的模型相当。此外,基于RGFS的模型在合理性方面优于基于LIME/SHAP的方法,并在忠实度方面接近其性能。