Backdoor attacks have emerged as a prominent threat to natural language processing (NLP) models, where the presence of specific triggers in the input can lead poisoned models to misclassify these inputs to predetermined target classes. Current detection mechanisms are limited by their inability to address more covert backdoor strategies, such as style-based attacks. In this work, we propose an innovative test-time poisoned sample detection framework that hinges on the interpretability of model predictions, grounded in the semantic meaning of inputs. We contend that triggers (e.g., infrequent words) are not supposed to fundamentally alter the underlying semantic meanings of poisoned samples as they want to stay stealthy. Based on this observation, we hypothesize that while the model's predictions for paraphrased clean samples should remain stable, predictions for poisoned samples should revert to their true labels upon the mutations applied to triggers during the paraphrasing process. We employ ChatGPT, a state-of-the-art large language model, as our paraphraser and formulate the trigger-removal task as a prompt engineering problem. We adopt fuzzing, a technique commonly used for unearthing software vulnerabilities, to discover optimal paraphrase prompts that can effectively eliminate triggers while concurrently maintaining input semantics. Experiments on 4 types of backdoor attacks, including the subtle style backdoors, and 4 distinct datasets demonstrate that our approach surpasses baseline methods, including STRIP, RAP, and ONION, in precision and recall.
翻译:后门攻击已成为自然语言处理(NLP)模型面临的显著威胁,攻击者通过在输入中嵌入特定触发器,可使中毒模型将这些输入错误分类至预设目标类别。现有检测机制因无法应对更隐蔽的后门策略(如基于风格的攻击)而存在局限性。本文提出了一种创新的测试时中毒样本检测框架,该框架基于模型预测的可解释性,并立足输入的语义含义。我们认为,触发器(如罕见词)本质上不应改变中毒样本的潜在语义,因为它们需保持隐蔽性。基于这一观察,我们假设:对于改写后的干净样本,模型的预测应保持稳定;而对于中毒样本,在改写过程中对触发器施加变异后,模型预测应恢复至其真实标签。我们采用当前最先进的大语言模型ChatGPT作为改述器,将触发器移除任务构建为提示工程问题。我们借鉴软件漏洞挖掘中常用的模糊测试技术,探索能够有效消除触发器同时保持输入语义的最优改写提示。在4种后门攻击类型(包括隐蔽的风格后门)和4个不同数据集上的实验表明,我们的方法在精确率和召回率上均超越了STRIP、RAP和ONION等基线方法。