Software vulnerabilities are increasing at an alarming rate. However, manual patching is both time-consuming and resource-intensive, while existing automated vulnerability repair (AVR) techniques remain limited in effectiveness. Recent advances in large language models (LLMs) have opened a new paradigm for AVR, demonstrating remarkable progress. To examine the capability of LLMs in AVR, several vulnerability benchmarks have been proposed recently. However, they still suffer from key limitations of outdated vulnerabilities, limited language coverage, unreliable patch validation, and insufficient reproducibility. To overcome these challenges, we introduce PATCHEVAL, a multilingual benchmark for Go, JavaScript, and Python, languages for which existing benchmarks remain unexplored. PATCHEVAL curates a dataset of 1,000 vulnerabilities drawn from CVEs reported between 2015 and 2025, covering 65 distinct CWEs. A subset of 230 CVEs is further equipped with runtime sandbox environments, enabling patch verification through both security tests and functionality tests. To provide a systematic comparison of LLM-based vulnerability repair, we evaluate a series of state-of-the-art LLMs and agents, presenting an in-depth analysis that empirically yields key insights to guide future research in AVR.
翻译:软件漏洞正以惊人的速度增长。然而,手动修补既耗时又耗费资源,而现有的自动化漏洞修复(AVR)技术在有效性方面仍存在局限。大型语言模型(LLM)的最新进展为AVR开辟了新范式,展现出显著进步。为考察LLM在AVR中的能力,近期已提出若干漏洞基准测试。但这些基准仍存在关键缺陷:漏洞数据陈旧、语言覆盖有限、补丁验证不可靠以及可复现性不足。为应对这些挑战,我们提出了PATCHEVAL——一个针对Go、JavaScript和Python的多语言基准测试,这些语言在现有基准中尚未得到充分探索。PATCHEVAL构建了包含1,000个漏洞的数据集,这些漏洞提取自2015年至2025年间报告的CVE条目,涵盖65种不同的CWE类型。其中230个CVE子集进一步配备了运行时沙箱环境,支持通过安全测试与功能测试进行补丁验证。为系统比较基于LLM的漏洞修复能力,我们评估了一系列前沿LLM与智能体,通过实证分析提出关键见解,为未来AVR研究提供指导。