SABER: Model-agnostic Backdoor Attack on Chain-of-Thought in Neural Code Generation

Recent studies have proposed integrating Chain-of-Thought (CoT) reasoning to further enhance the reliability of Code Language Models (CLMs) in generating code, a step-by-step approach that breaks down complex programming tasks into manageable sub-problems. Advances in this area have introduced CoT models, specifically designed to integrate CoT reasoning effectively into language models, achieving notable improvements in code generation. Despite these advancements, the security of CoT models has not been systematically studied. In this study, we aim to fill this gap by investigating the vulnerability of CoT models to backdoor injection in code generation tasks. To address this, we propose a model-agnostic backdoor attack method SABER (\textbf{S}elf-\textbf{A}ttention-\textbf{B}as\textbf{E}d backdoo\textbf{R}) based on the self-attention mechanism. SABER begins by selecting a malicious output as the backdoor using code mutation operations. It then identifies tokens most relevant to poisoned content by analyzing self-attention scores in the CodeBERT model. Finally, it applies semantic-preserving perturbations to generate adaptive and natural triggers. Our experiments on HumanEval-CoT and OpenEval-CoT test sets demonstrate that CoT models are susceptible to backdoor attacks via data poisoning. Taking the OpenEval-CoT dataset as an example, SABER achieves an ASR of 76.19%, representing an improvement of 14.29% over RIPPLe and a substantial 23.08% enhancement compared to BadPre. Further evaluations using ONION for automated detection and human studies reveal that SABER is stealthier and harder to detect, bypassing 77.27% of automated detection, with a human detection rate of just 3.17%. Our findings reveal that backdoors can be injected into CoT models to manipulate downstream code generation tasks.

翻译：近期研究提出集成思维链推理以进一步提升代码语言模型生成代码的可靠性，这是一种将复杂编程任务分解为可管理子问题的逐步方法。该领域的进展引入了专门设计的思维链模型，旨在将思维链推理有效集成到语言模型中，在代码生成方面取得了显著改进。尽管取得了这些进展，思维链模型的安全性尚未得到系统研究。在本研究中，我们旨在通过探究思维链模型在代码生成任务中遭受后门注入的脆弱性来填补这一空白。为此，我们提出了一种基于自注意力机制的模型无关后门攻击方法SABER（\textbf{S}elf-\textbf{A}ttention-\textbf{B}as\textbf{E}d backdoo\textbf{R}）。SABER首先通过代码变异操作选择恶意输出作为后门，然后通过分析CodeBERT模型中的自注意力分数来识别与污染内容最相关的词元，最后应用语义保持扰动来生成自适应且自然的触发器。我们在HumanEval-CoT和OpenEval-CoT测试集上的实验表明，思维链模型容易通过数据中毒遭受后门攻击。以OpenEval-CoT数据集为例，SABER实现了76.19%的攻击成功率，相比RIPPLe提升了14.29%，与BadPre相比则显著提高了23.08%。使用ONION进行自动检测的进一步评估及人工研究表明，SABER更具隐蔽性且更难以检测，绕过了77.27%的自动检测，人工检测率仅为3.17%。我们的研究结果表明，后门可被注入思维链模型以操纵下游代码生成任务。