LLM-based vulnerability detectors are increasingly deployed in security-critical code review, yet their resilience to evasion under behavior-preserving edits remains poorly understood. We evaluate detection-time integrity under a semantics-preserving threat model by instantiating diverse behavior-preserving code transformations on a unified C/C++ benchmark (N=5000), and introduce a metric of joint robustness across different attack methods/carriers. Across models, we observe a systemic failure of semantic invariant adversarial transformations: even state-of-the-art vulnerability detectors perform well on clean inputs while predictions flip under behavior-equivalent edits. Universal adversarial strings optimized on a single surrogate model remain effective when transferred to black-box APIs, and gradient access can further amplify evasion success. These results show that even high-performing detectors are vulnerable to low-cost, semantics-preserving evasion. Our carrier-based metrics provide practical diagnostics for evaluating LLM-based code detectors.
翻译:基于LLM的漏洞检测器正日益部署于安全关键的代码审查中,但其在行为保持编辑下的抗规避能力仍鲜为人知。我们通过在统一的C/C++基准测试集(N=5000)上实例化多种行为保持的代码转换,评估了语义保持威胁模型下的检测时完整性,并提出了跨不同攻击方法/载体的联合鲁棒性度量。在所有模型中,我们观察到语义不变对抗变换的系统性失效:即使最先进的漏洞检测器在原始输入上表现良好,但在行为等效的编辑下预测结果会发生翻转。在单一代理模型上优化的通用对抗字符串在迁移至黑盒API时仍保持有效性,而梯度访问可进一步放大规避成功率。这些结果表明,即使高性能检测器也容易受到低成本、语义保持的规避攻击。我们提出的基于载体的度量为评估基于LLM的代码检测器提供了实用诊断工具。