Transferable Backdoor Attacks for Code Models via Sharpness-Aware Adversarial Perturbation

Code models are increasingly adopted in software development but remain vulnerable to backdoor attacks via poisoned training data. Existing backdoor attacks on code models face a fundamental trade-off between transferability and stealthiness. Static trigger-based attacks insert fixed dead code patterns that transfer well across models and datasets but are easily detected by code-specific defenses. In contrast, dynamic trigger-based attacks adaptively generate context-aware triggers to evade detection but suffer from poor cross-dataset transferability. Moreover, they rely on unrealistic assumptions of identical data distributions between poisoned and victim training data, limiting their practicality. To overcome these limitations, we propose Sharpness-aware Transferable Adversarial Backdoor (STAB), a novel attack that achieves both transferability and stealthiness without requiring complete victim data. STAB is motivated by the observation that adversarial perturbations in flat regions of the loss landscape transfer more effectively across datasets than those in sharp minima. To this end, we train a surrogate model using Sharpness-Aware Minimization to guide model parameters toward flat loss regions, and employ Gumbel-Softmax optimization to enable differentiable search over discrete trigger tokens for generating context-aware adversarial triggers. Experiments across three datasets and two code models show that STAB outperforms prior attacks in terms of transferability and stealthiness. It achieves a 73.2% average attack success rate after defense, outperforming static trigger-based attacks that fail under defense. STAB also surpasses the best dynamic trigger-based attack by 12.4% in cross-dataset attack success rate and maintains performance on clean inputs.

翻译：代码模型在软件开发中的应用日益广泛，但其仍易受通过投毒训练数据实施的后门攻击影响。现有针对代码模型的后门攻击面临可迁移性与隐蔽性之间的根本性权衡：基于静态触发器的攻击通过插入固定的无效代码模式，虽能实现良好的跨模型与跨数据集迁移，却易被针对代码的防御机制检测；而基于动态触发器的攻击虽能自适应生成上下文感知的触发器以规避检测，但其跨数据集迁移能力较差。此外，这类攻击依赖投毒数据与受害训练数据分布完全一致的理想化假设，限制了其实用性。为突破这些局限，我们提出锐度感知可迁移对抗性后门攻击（STAB），该新型攻击方法在不需获取完整受害数据的前提下，同时实现了可迁移性与隐蔽性。STAB的提出基于以下观察：相较于位于损失函数尖锐极小值区域的对抗扰动，位于平坦区域的对抗扰动在跨数据集迁移时更为有效。为此，我们采用锐度感知最小化方法训练代理模型，引导模型参数向损失平坦区域移动，并利用Gumbel-Softmax优化实现对离散触发标记的可微分搜索，从而生成上下文感知的对抗性触发器。在三个数据集和两种代码模型上的实验表明，STAB在可迁移性与隐蔽性方面均优于现有攻击方法。在防御机制下仍能达到73.2%的平均攻击成功率，而基于静态触发器的攻击在防御下完全失效。STAB的跨数据集攻击成功率较最佳动态触发器攻击提升12.4%，且在干净输入上保持原有性能。