How do security scanners perform on real-world code? We present RealVuln, the first open-source benchmark comparing Rule-Based SAST, General-Purpose LLMs, and Security-Specialized scanners on 26 intentionally vulnerable Python repositories (educational and Capture-The-Flag applications) with 796 hand-labeled entries (676 vulnerabilities, 120 false-positive traps). We test 15 scanners (3 Rule-Based SAST, 10 General-Purpose LLM, 2 Security-Specialized) and rank them by F3 score (beta=3, weighting recall 9x over precision). A clear three-tier ranking emerges under all metrics. Under F3, the Security-Specialized scanner Kolega.Dev (73.0) leads, followed by the best General-Purpose LLM, Claude Sonnet 4.6 (51.7), which in turn scores nearly 3x higher than the best Rule-Based tool, Semgrep (17.7). Under F1, Sonnet 4.6 leads (60.9) with Kolega.Dev at 52.4. Rankings within tiers shift with beta, but the three-tier hierarchy holds across all weightings. All code, ground-truth data, scanner outputs, and scoring scripts are released under an open-source license. An interactive dashboard is at https://realvuln.kolega.dev/. RealVuln is a living benchmark: versioned, community-driven, with a roadmap toward multi-language coverage.
翻译:安全扫描器在真实代码上的表现如何?我们提出RealVuln,这是首个公开的基准测评,在26个故意包含漏洞的Python仓库(包括教学和夺旗类应用)中,对基于规则的静态应用安全测试(SAST)、通用大语言模型(LLM)和安全专用扫描器进行了对比。该数据集包含796条手工标注条目(676个漏洞,120个假阳性陷阱)。我们测试了15款扫描器(3款基于规则的SAST、10款通用大语言模型、2款安全专用扫描器),并以F3分数(beta=3,召回率权重为精确率的9倍)进行排序。在所有指标下,均呈现出清晰的三个层级排名。在F3指标下,安全专用扫描器Kolega.Dev(73.0)领先,其次是表现最佳的通用大语言模型Claude Sonnet 4.6(51.7),其得分约为最佳基于规则工具Semgrep(17.7)的3倍。在F1指标下,Sonnet 4.6领先(60.9),Kolega.Dev为52.4。层级内部的排名随beta值变化,但三层级结构在所有权重设置下均保持稳定。所有代码、真实标注数据、扫描器输出结果及评分脚本均以开源许可证发布。交互式仪表盘位于https://realvuln.kolega.dev/。RealVuln是一个持续更新的基准:采用版本化、社区驱动的方式,并规划了向多语言覆盖扩展的路线图。