AI-assisted code review is widely used to detect vulnerabilities before production release. Prior work shows that adversarial prompt manipulation can degrade large language model (LLM) performance in code generation. We test whether similar comment-based manipulation misleads LLMs during vulnerability detection. We build a 100-sample benchmark across Python, JavaScript, and Java, each paired with eight comment variants ranging from no comments to adversarial strategies such as authority spoofing and technical deception. Eight frontier models, five commercial and three open-source, are evaluated in 9,366 trials. Adversarial comments produce small, statistically non-significant effects on detection accuracy (McNemar exact p > 0.21; all 95 percent confidence intervals include zero). This holds for commercial models with 89 to 96 percent baseline detection and open-source models with 53 to 72 percent, despite large absolute performance gaps. Unlike generation settings where comment manipulation achieves high attack success, detection performance does not meaningfully degrade. More complex adversarial strategies offer no advantage over simple manipulative comments. We test four automated defenses across 4,646 additional trials (14,012 total). Static analysis cross-referencing performs best at 96.9 percent detection and recovers 47 percent of baseline misses. Comment stripping reduces detection for weaker models by removing helpful context. Failures concentrate on inherently difficult vulnerability classes, including race conditions, timing side channels, and complex authorization logic, rather than on adversarial comments.
翻译:AI辅助代码审查被广泛用于在生产发布前检测漏洞。先前研究表明,对抗性提示操纵会降低大语言模型(LLM)在代码生成任务中的性能。我们测试了类似的基于注释的操纵是否会在漏洞检测过程中误导LLM。我们构建了一个包含Python、JavaScript和Java语言共100个样本的基准测试集,每个样本均配有八种注释变体,范围从无注释到权威伪装和技术欺骗等对抗性策略。我们在9,366次试验中评估了八个前沿模型(五个商业模型和三个开源模型)。对抗性注释对检测准确率的影响微小且统计上不显著(McNemar精确检验p值>0.21;所有95%置信区间均包含零)。这一结论适用于基线检测率为89%至96%的商业模型和基线检测率为53%至72%的开源模型,尽管它们存在巨大的绝对性能差距。与注释操纵在生成场景中能实现高攻击成功率不同,检测性能并未出现显著下降。更复杂的对抗性策略相比简单的操纵性注释并未展现出优势。我们在4,646次额外试验(总计14,012次)中测试了四种自动化防御机制。静态分析交叉引用表现最佳,检测率达到96.9%,并恢复了47%的基线漏报。注释剥离会移除有益上下文,反而降低了较弱模型的检测能力。失败案例主要集中在固有难度较高的漏洞类别,包括竞态条件、时序侧信道和复杂授权逻辑,而非由对抗性注释导致。