Automatic hate speech detection using deep neural models is hampered by the scarcity of labeled datasets, leading to poor generalization. To mitigate this problem, generative AI has been utilized to generate large amounts of synthetic hate speech sequences from available labeled examples, leveraging the generated data in finetuning large pre-trained language models (LLMs). In this chapter, we provide a review of relevant methods, experimental setups and evaluation of this approach. In addition to general LLMs, such as BERT, RoBERTa and ALBERT, we apply and evaluate the impact of train set augmentation with generated data using LLMs that have been already adapted for hate detection, including RoBERTa-Toxicity, HateBERT, HateXplain, ToxDect, and ToxiGen. An empirical study corroborates our previous findings, showing that this approach improves hate speech generalization, boosting recall performance across data distributions. In addition, we explore and compare the performance of the finetuned LLMs with zero-shot hate detection using a GPT-3.5 model. Our results demonstrate that while better generalization is achieved using the GPT-3.5 model, it achieves mediocre recall and low precision on most datasets. It is an open question whether the sensitivity of models such as GPT-3.5, and onward, can be improved using similar techniques of text generation.
翻译:使用深度神经模型自动检测仇恨言论受到标记数据集稀缺的制约,导致泛化能力较差。为缓解这一问题,研究者利用生成式人工智能从现有标记样本中生成大量合成仇恨言论序列,并将生成数据用于微调大型预训练语言模型(LLMs)。本章回顾了相关方法、实验设置及该方法的评估。除通用LLMs(如BERT、RoBERTa、ALBERT)外,我们还应用并评估了通过生成数据扩充训练集对已针对仇恨检测优化的LLMs的影响,包括RoBERTa-Toxicity、HateBERT、HateXplain、ToxDect和ToxiGen。实证研究印证了我们之前的发现:该方法能提升仇恨言论检测的泛化能力,提高各数据分布上的召回性能。此外,我们探索并比较了微调后的LLMs与基于GPT-3.5模型的零样本仇恨检测的表现。结果表明,虽然GPT-3.5模型实现了更好的泛化,但它在大多数数据集上召回率中等、精确率偏低。能否通过类似文本生成技术提升GPT-3.5及后续模型的敏感度,仍是一个开放性问题。