Large language model-powered code agents are rapidly transforming software engineering, yet the security risks of their generated code have become a critical concern. Existing benchmarks have provided valuable insights, but they fail to capture scenarios in which vulnerabilities are actually introduced by human developers, making fair comparisons between humans and agents infeasible. We therefore introduce SecureVibeBench, a benchmark of 105 C/C++ secure coding tasks sourced from 41 projects in OSS-Fuzz for code agents. SecureVibeBench has the following features: (i) realistic task settings that require multi-file edits in large repositories, (ii)~aligned contexts based on real-world open-source vulnerabilities with precisely identified vulnerability introduction points, and (iii) comprehensive evaluation that combines functionality testing and security checking with both static and dynamic oracles. We evaluate 5 popular code agents like OpenHands, supported by 5 LLMs (e.g., Claude sonnet 4.5) on SecureVibeBench. Results show that current agents struggle to produce both correct and secure code, as even the best-performing one, produces merely 23.8\% correct and secure solutions on SecureVibeBench. Our code and data are on https://github.com/iCSawyer/SecureVibeBench.
翻译:大语言模型驱动的代码智能体正在快速变革软件工程,但其生成代码的安全风险已成为关键关注点。现有基准测试提供了宝贵见解,但未能捕捉人类开发者实际引入漏洞的场景,使得人类与智能体之间的公平比较不可行。为此,我们提出SecureVibeBench——一个基于OSS-Fuzz中41个项目的105个C/C++安全编码任务的基准测试集。SecureVibeBench具有以下特征:(i)基于真实任务的场景设置,要求在大规模代码仓库中进行多文件编辑;(ii)基于真实世界开源漏洞的上下文对齐,精确定位漏洞引入点;(iii)结合静态与动态预言机的功能测试与安全检查综合评估方法。我们在SecureVibeBench上评估了5个流行代码智能体(如OpenHands)及其支持的5个大语言模型(如Claude Sonnet 4.5)。结果表明,当前智能体难以同时生成正确且安全的代码,即使表现最佳的智能体在SecureVibeBench上也仅产生23.8%的正确且安全解决方案。我们的代码与数据见https://github.com/iCSawyer/SecureVibeBench。