The rise of cyberbullying in social media platforms involving toxic comments has escalated the need for effective ways to monitor and moderate online interactions. Existing solutions of automated toxicity detection systems, are based on a machine or deep learning algorithms. However, statistics-based solutions are generally prone to adversarial attacks that contain logic based modifications such as negation in phrases and sentences. In that regard, we present a set of formal reasoning-based methodologies that wrap around existing machine learning toxicity detection systems. Acting as both pre-processing and post-processing steps, our formal reasoning wrapper helps alleviating the negation attack problems and significantly improves the accuracy and efficacy of toxicity scoring. We evaluate different variations of our wrapper on multiple machine learning models against a negation adversarial dataset. Experimental results highlight the improvement of hybrid (formal reasoning and machine-learning) methods against various purely statistical solutions.
翻译:社交媒体平台中涉及恶意评论的网络欺凌现象日益增多,加剧了对在线互动进行有效监控与管理的需求。现有的自动化毒性检测系统解决方案通常基于机器学习或深度学习算法。然而,基于统计的解决方案普遍容易受到包含逻辑修改(如短语和句子中的否定形式)的对抗性攻击。为此,我们提出一套基于形式化推理的方法论,可封装现有的机器学习毒性检测系统。该形式化推理封装器同时作为预处理和后处理步骤,有助于缓解否定攻击问题,并显著提升毒性评分的准确性与有效性。我们在否定对抗数据集上针对多种机器学习模型评估了封装器的不同变体。实验结果凸显了混合(形式化推理与机器学习)方法相较于各类纯统计解决方案的改进效果。