Backdoor attacks have emerged as a prominent threat to natural language processing (NLP) models, where the presence of specific triggers in the input can lead poisoned models to misclassify these inputs to predetermined target classes. Current detection mechanisms are limited by their inability to address more covert backdoor strategies, such as style-based attacks. In this work, we propose an innovative test-time poisoned sample detection framework that hinges on the interpretability of model predictions, grounded in the semantic meaning of inputs. We contend that triggers (e.g., infrequent words) are not supposed to fundamentally alter the underlying semantic meanings of poisoned samples as they want to stay stealthy. Based on this observation, we hypothesize that while the model's predictions for paraphrased clean samples should remain stable, predictions for poisoned samples should revert to their true labels upon the mutations applied to triggers during the paraphrasing process. We employ ChatGPT, a state-of-the-art large language model, as our paraphraser and formulate the trigger-removal task as a prompt engineering problem. We adopt fuzzing, a technique commonly used for unearthing software vulnerabilities, to discover optimal paraphrase prompts that can effectively eliminate triggers while concurrently maintaining input semantics. Experiments on 4 types of backdoor attacks, including the subtle style backdoors, and 4 distinct datasets demonstrate that our approach surpasses baseline methods, including STRIP, RAP, and ONION, in precision and recall.
翻译:摘要:后门攻击已成为自然语言处理模型面临的突出威胁,攻击者通过在输入中植入特定触发器,使中毒模型将这些输入错误分类至预设目标类别。现有检测机制因无法应对更隐蔽的后门策略(如基于风格的攻击)而存在局限性。本文提出一种创新的测试时中毒样本检测框架,该框架基于模型预测的可解释性,并扎根于输入的语义含义。我们认为,触发器(如低频词)不应从根本上改变中毒样本的底层语义,因为其需保持隐蔽性。基于这一观察,我们假设:虽然模型对释义后干净样本的预测应保持稳定,但对于中毒样本,在释义过程中触发器发生变异时,其预测应恢复至真实标签。我们采用当前最先进的大语言模型ChatGPT作为释义器,并将触发器移除任务建模为提示工程问题。我们引入模糊测试——一种常用于挖掘软件漏洞的技术——来发现能有效消除触发器并同时保持输入语义的最优释义提示。针对包括隐蔽风格后门在内的4种后门攻击类型和4个不同数据集的实验表明,我们的方法在精确率和召回率上均优于STRIP、RAP和ONION等基线方法。