Textual backdoor attacks pose significant security threats. Current detection approaches, typically relying on intermediate feature representation or reconstructing potential triggers, are task-specific and less effective beyond sentence classification, struggling with tasks like question answering and named entity recognition. We introduce TABDet (Task-Agnostic Backdoor Detector), a pioneering task-agnostic method for backdoor detection. TABDet leverages final layer logits combined with an efficient pooling technique, enabling unified logit representation across three prominent NLP tasks. TABDet can jointly learn from diverse task-specific models, demonstrating superior detection efficacy over traditional task-specific methods.
翻译:文本后门攻击构成重大安全威胁。当前的检测方法通常依赖中间特征表示或重构潜在触发器,具有任务特异性,在句子分类之外的任务中效果不佳,难以应对问答和命名实体识别等任务。我们提出TABDet(任务无关后门检测器),这是一种开创性的任务无关后门检测方法。TABDet利用最后一层的logits结合高效的池化技术,在三种主流NLP任务上实现统一的logit表示。TABDet能够从多样化的任务特定模型中联合学习,展现出比传统任务特定方法更优的检测效能。