We present CrossCommitVuln-Bench, a curated benchmark of 15 real-world Python vulnerabilities (CVEs) in which the exploitable condition was introduced across multiple commits - each individually benign to per-commit static analysis - but collectively critical. We manually annotate each CVE with its contributing commit chain, a structured rationale for why each commit evades per-commit analysis, and baseline evaluations using Semgrep and Bandit in both per-commit and cumulative scanning modes. Our central finding: the per-commit detection rate (CCDR) is 13% across all 15 vulnerabilities - 87% of chains are invisible to per-commit SAST. Critically, both per-commit detections are qualitatively poor: one occurs on commits framed as security fixes (where developers suppress the alert), and the other detects only the minor hardcoded-key component while completely missing the primary vulnerability (200+ unprotected API endpoints). Even in cumulative mode (full codebase present), the detection rate is only 27%, confirming that snapshot-based SAST tools often miss vulnerabilities whose introduction spans multiple commits. The dataset, annotation schema, evaluation scripts, and reproducible baselines are released under open-source licenses to support research on cross-commit vulnerability detection.
翻译:我们提出CrossCommitVuln-Bench,这是一个精选的基准数据集,包含15个真实世界的Python漏洞(CVE),其中可利用的条件跨越多个提交引入——每个提交单独对单提交静态分析无害,但组合后构成严重威胁。我们人工为每个CVE标注其贡献提交链、每个提交为何能规避单提交分析的结构化理由,以及使用Semgrep和Bandit在单提交与累积扫描模式下的基线评估。核心发现:单提交检测率(CCDR)在全部15个漏洞中仅为13%——87%的漏洞链对单提交SAST不可见。关键的是,两次单提交检测在质量上均存在问题:一次发生在被标记为安全修复的提交中(开发人员会抑制警报),另一次仅检测到次要的硬编码密钥组件,完全遗漏了主要漏洞(200多个未受保护的API端点)。即使在累积模式(完整代码库存在)下,检测率也仅为27%,证实基于快照的SAST工具经常遗漏跨多个提交引入的漏洞。该数据集、标注架构、评估脚本及可复现基线均以开源许可发布,以支持跨提交漏洞检测研究。