SEVRA-BENCH: Social Engineering of Vulnerabilities in Review Agents

Large language model (LLM) reviewers are increasingly used in pull-request (PR) workflows, where their approvals help decide which code is merged into a repository. This raises a question that benchmarks for static vulnerability detection or code generation do not address: can an automated reviewer reject a malicious contribution when the attacker controls both the code change and the accompanying PR text? We introduce SEVRA-BENCH (Social Engineering of Vulnerabilities in Review Agents), a benchmark that measures how often an automated reviewer approves such adversarial pull requests. Each malicious PR in SEVRA-BENCH is built from a real project commit that previously fixed a vulnerability listed in the Common Vulnerabilities and Exposures (CVE) database. We automatically invert that fix to restore the original vulnerable code and submit it as a pull request wrapped in one of 15 social-engineering framings, which vary the claims made, the supporting evidence, the urgency conveyed, signals of prior approval, and appeals to authority. SEVRA-BENCH contains 1,062 malicious PRs drawn from Common Vulnerabilities and Exposures (CVE)-linked fixes across the top 10 entries of the 2025 Common Weakness Enumeration (CWE) Top 25. In a realistic setting, we evaluate 8 current LLMs as code review agents on PRs that introduce vulnerabilities previously reported in public disclosures. Our results reveal a sharp gap in security capabilities between closed- and open-source models. We hope SEVRA-BENCH will serve as a valuable resource for advancing open-source models and narrowing this gap.

翻译：基于大语言模型的代码审查者越来越多地应用于拉取请求工作流中，其审批结果直接影响代码是否被合并到仓库。这引出了一个静态漏洞检测或代码生成基准无法回答的问题：当攻击者同时控制代码变更和附带的拉取请求文本时，自动审查者能否拒绝恶意贡献？我们提出SEVRA-BENCH（审查代理中的社会工程漏洞基准），一个衡量自动审查者批准此类对抗性拉取请求频率的基准。SEVRA-BENCH中的每个恶意拉取请求均源于真实项目提交——这些提交此前修复了通用漏洞披露（CVE）数据库中列出的漏洞。我们通过逆向修复过程恢复原始存在漏洞的代码，并将其封装在15种社会工程框架中作为拉取请求提交，这些框架在提出的声明、支撑证据、紧迫性传达、先前批准信号以及诉诸权威方面存在差异。SEVRA-BENCH包含1,062个恶意拉取请求，这些请求源自2025年常见弱点枚举（CWE）Top 25中前十项对应的CVE关联修复。在现实场景下，我们评估了8个当前的大语言模型作为代码审查代理处理那些引入先前已公开披露漏洞的拉取请求的能力。我们的结果揭示了闭源模型与开源模型在安全能力上存在显著差距。我们希望SEVRA-BENCH能够成为推动开源模型发展、缩小这一差距的重要资源。