Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review

Security code reviews increasingly rely on systems integrating Large Language Models (LLMs), ranging from interactive assistants to autonomous agents in CI/CD pipelines. We study whether confirmation bias (i.e., the tendency to favor interpretations that align with prior expectations) affects LLM-based vulnerability detection, and whether this failure mode can be exploited in software supply-chain attacks. We conduct two complementary studies. Study 1 quantifies confirmation bias through controlled experiments on 250 CVE vulnerability/patch pairs evaluated across four state-of-the-art models under five framing conditions for the review prompt. Framing a change as bug-free reduces vulnerability detection rates by 16-93%, with strongly asymmetric effects: false negatives increase sharply while false positive rates change little. Bias effects vary by vulnerability type, with injection flaws being more susceptible to them than memory corruption bugs. Study 2 evaluates exploitability in practice mimicking adversarial pull requests that reintroduce known vulnerabilities while framed as security improvements or urgent functionality fixes via their pull request metadata. Adversarial framing succeeds in 35% of cases against GitHub Copilot (interactive assistant) under one-shot attacks and in 88% of cases against Claude Code (autonomous agent) in real project configurations where adversaries can iteratively refine their framing to increase attack success. Debiasing via metadata redaction and explicit instructions restores detection in all interactive cases and 94% of autonomous cases. Our results show that confirmation bias poses a weakness in LLM-based code review, with implications on how AI-assisted development tools are deployed.

翻译：安全代码审查日益依赖集成大语言模型的系统，范围涵盖从交互式助手到CI/CD流水线中的自主代理。我们研究确认偏误（即倾向于偏爱符合先前预期的解释）是否影响基于大语言模型的漏洞检测，以及这种失效模式能否被软件供应链攻击所利用。我们开展了两项互补性研究。研究1通过控制实验量化确认偏误：在四种最新模型下，对250组CVE漏洞/补丁对，设置五种审查提示框架条件进行评测。将变更描述为无缺陷时，漏洞检测率降低16-93%，且呈现强不对称效应：假阴性率急剧上升，而假阳性率变化甚微。偏误效应因漏洞类型而异，注入类漏洞比内存破坏漏洞更易受影响。研究2评估实际场景中的可利用性，模拟恶意拉取请求通过元数据将已知漏洞重新引入，同时将其包装为安全改进或紧急功能修复。单次攻击下，对抗性框架在针对GitHub Copilot（交互式助手）的成功率达35%；而在攻击者可迭代优化框架以提高攻击成功率的实际项目配置中，针对Claude Code（自主代理）的成功率达88%。通过元数据删减和显式指令进行去偏，可恢复所有交互式案例及94%自主案例的检测能力。研究结果表明，确认偏误构成了基于大语言模型代码审查的薄弱环节，对AI辅助开发工具的部署方式具有重要启示。