Recent advances in natural language processing and the increased use of large language models have exposed new security vulnerabilities, such as backdoor attacks. Previous backdoor attacks require input manipulation after model distribution to activate the backdoor, posing limitations in real-world applicability. Addressing this gap, we introduce a novel Claim-Guided Backdoor Attack (CGBA), which eliminates the need for such manipulations by utilizing inherent textual claims as triggers. CGBA leverages claim extraction, clustering, and targeted training to trick models to misbehave on targeted claims without affecting their performance on clean data. CGBA demonstrates its effectiveness and stealthiness across various datasets and models, significantly enhancing the feasibility of practical backdoor attacks. Our code and data will be available at https://github.com/PaperCGBA/CGBA.
翻译:自然语言处理的最新进展以及大型语言模型的广泛应用暴露了新的安全漏洞,例如后门攻击。以往的后门攻击需要在模型部署后对输入进行操控才能激活后门,这在实际应用中存在局限性。针对这一不足,我们提出了一种新颖的基于声明的后门攻击方法,该方法利用文本中固有的声明作为触发器,从而消除了对输入操控的需求。CGBA通过声明提取、聚类和定向训练,诱使模型在目标声明上产生错误行为,同时不影响其在干净数据上的性能。CGBA在多种数据集和模型上均展现出其有效性和隐蔽性,显著提升了实际后门攻击的可行性。我们的代码和数据将在 https://github.com/PaperCGBA/CGBA 上公开。