The spread of toxic content online is an important problem that has adverse effects on user experience online and in our society at large. Motivated by the importance and impact of the problem, research focuses on developing solutions to detect toxic content, usually leveraging machine learning (ML) models trained on human-annotated datasets. While these efforts are important, these models usually do not generalize well and they can not cope with new trends (e.g., the emergence of new toxic terms). Currently, we are witnessing a shift in the approach to tackling societal issues online, particularly leveraging large language models (LLMs) like GPT-3 or T5 that are trained on vast corpora and have strong generalizability. In this work, we investigate how we can use LLMs and prompt learning to tackle the problem of toxic content, particularly focusing on three tasks; 1) Toxicity Classification, 2) Toxic Span Detection, and 3) Detoxification. We perform an extensive evaluation over five model architectures and eight datasets demonstrating that LLMs with prompt learning can achieve similar or even better performance compared to models trained on these specific tasks. We find that prompt learning achieves around 10\% improvement in the toxicity classification task compared to the baselines, while for the toxic span detection task we find better performance to the best baseline (0.643 vs. 0.640 in terms of $F_1$-score). Finally, for the detoxification task, we find that prompt learning can successfully reduce the average toxicity score (from 0.775 to 0.213) while preserving semantic meaning.
翻译:网络上不良信息的扩散是一个重要问题,对用户在线体验乃至整个社会都会产生负面影响。鉴于该问题的重要性和影响力,研究重点在于开发检测有毒内容的解决方案,通常利用基于人工标注数据集训练的机器学习模型。尽管这些努力至关重要,但此类模型通常泛化能力不佳,且难以应对新趋势(例如新型有害词汇的出现)。目前,我们正目睹解决在线社会问题的方法转变,特别是利用在大型语料库上训练并具有强泛化能力的大型语言模型,如GPT-3或T5。本研究探究如何利用大语言模型和提示学习来解决有毒内容问题,重点关注三个任务:(1)毒性分类;(2)有毒跨度检测;(3)净化解毒。我们针对五个模型架构和八个数据集进行了广泛评估,结果表明,采用提示学习的大语言模型在这些特定任务上能够达到甚至优于专门训练模型的性能。与基线方法相比,提示学习在毒性分类任务上实现了约10%的提升;而在有毒跨度检测任务上,其性能优于最佳基线(以F1分数计为0.643 vs. 0.640)。最后,在净化解毒任务中,我们发现提示学习能成功将平均毒性分数从0.775降至0.213,同时保留语义含义。