Online texts with toxic content are a threat in social media that might cause cyber harassment. Although many platforms applied measures, such as machine learning-based hate-speech detection systems, to diminish their effect, those toxic content publishers can still evade the system by modifying the spelling of toxic words. Those modified words are also known as human-written text perturbations. Many research works developed certain techniques to generate adversarial samples to help the machine learning models obtain the ability to recognize those perturbations. However, there is still a gap between those machine-generated perturbations and human-written perturbations. In this paper, we introduce a benchmark test set containing human-written perturbations online for toxic speech detection models. We also recruited a group of workers to evaluate the quality of this test set and dropped low-quality samples. Meanwhile, to check if our perturbation can be normalized to its clean version, we applied spell corrector algorithms on this dataset. Finally, we test this data on state-of-the-art language models, such as BERT and RoBERTa, and black box APIs, such as perspective API, to demonstrate the adversarial attack with real human-written perturbations is still effective.
翻译:在线文本中的有害内容是对社交媒体的一种威胁,可能导致网络骚扰。尽管许多平台采取了诸如基于机器学习的有害言论检测系统等措施来减轻其影响,但这些有害内容的发布者仍可通过修改有害词汇的拼写来规避系统。这些修改后的词汇也被称为人工文本扰动。许多研究开发了特定技术来生成对抗样本,以帮助机器学习模型获得识别这些扰动的能力。然而,机器生成的扰动与人工撰写的扰动之间仍存在差距。本文介绍了一个包含在线人工扰动的基准测试集,用于有害言论检测模型。我们还招募了一组工作人员评估该测试集的质量,并剔除了低质量样本。同时,为了检验我们的扰动是否能被标准化为纯净版本,我们在该数据集上应用了拼写校正算法。最后,我们在BERT、RoBERTa等先进语言模型以及Perspective API等黑盒API上测试该数据,以证明真实人工扰动下的对抗攻击仍然有效。