ContiGuard: A Framework for Continual Toxicity Detection Against Evolving Evasive Perturbations

Toxicity detection mitigates the dissemination of toxic content (e.g., hateful comments, posts, and messages within online social actions) to safeguard a healthy online social environment. However, malicious users persistently develop evasive perturbations to disguise toxic content and evade detectors. Traditional detectors or methods are static over time and are inadequate in addressing these evolving evasion tactics. Thus, continual learning emerges as a logical approach to dynamically update detection ability against evolving perturbations. Nevertheless, disparities across perturbations hinder the detector's continual learning on perturbed text. More importantly, perturbation-induced noises distort semantics to degrade comprehension and also impair critical feature learning to render detection sensitive to perturbations. These amplify the challenge of continual learning against evolving perturbations. In this work, we present ContiGuard, the first framework tailored for continual learning of the detector on time-evolving perturbed text (termed continual toxicity detection) to enable the detector to continually update capability and maintain sustained resilience against evolving perturbations. Specifically, to boost the comprehension, we present an LLM-powered semantic enriching strategy, where we dynamically incorporate possible meaning and toxicity-related clues excavated by LLM into the perturbed text to improve the comprehension. To mitigate non-critical features and amplify critical ones, we propose a discriminability-driven feature learning strategy, where we strengthen discriminative features while suppressing the less-discriminative ones to shape a robust classification boundary for detection...

翻译：毒性检测旨在遏制有毒内容（例如在线社交活动中的仇恨性评论、帖子和消息）的传播，以维护健康的在线社交环境。然而，恶意用户持续开发规避性扰动来伪装有毒内容并逃避检测器。传统的检测器或方法在时间上是静态的，不足以应对这些不断演化的规避策略。因此，持续学习成为一种合乎逻辑的方法，以动态更新针对演化扰动的检测能力。然而，不同扰动之间的差异阻碍了检测器在扰动文本上的持续学习。更重要的是，扰动引入的噪声会扭曲语义从而降低理解能力，同时也会损害关键特征学习，导致检测对扰动敏感。这些因素放大了针对演化扰动进行持续学习的挑战。在本工作中，我们提出了ContiGuard，这是首个专为检测器在时间演化的扰动文本上进行持续学习（称为持续毒性检测）而设计的框架，使检测器能够持续更新能力并保持对演化扰动的持久韧性。具体而言，为增强理解能力，我们提出了一种基于大语言模型（LLM）的语义增强策略，动态地将LLM挖掘出的可能含义及与毒性相关的线索融入扰动文本，以提升理解效果。为抑制非关键特征并增强关键特征，我们提出了一种可区分性驱动的特征学习策略，通过强化判别性特征同时抑制判别性较弱的特征，以构建一个鲁棒的检测分类边界。