Vibe coding is a new programming paradigm in which human engineers instruct large language model (LLM) agents to complete complex coding tasks with little supervision. Although vibe coding is increasingly adopted, are its outputs really safe to deploy in production? To answer this question, we propose SU S VI B E S, a benchmark consisting of 200 feature-request software engineering tasks from real-world open-source projects, which, when given to human programmers, led to vulnerable implementations. We evaluate multiple widely used coding agents with frontier models on this benchmark. Disturbingly, all agents perform poorly in terms of software security. Although 61% of the solutions from SWE-Agent with Claude 4 Sonnet are functionally correct, only 10.5% are secure. Further experiments demonstrate that preliminary security strategies, such as augmenting the feature request with vulnerability hints, cannot mitigate these security issues. Our findings raise serious concerns about the widespread adoption of vibe-coding, particularly in security-sensitive applications.
翻译:Vibe coding是一种新兴的编程范式,人类工程师通过指导大型语言模型(LLM)智能体以最小监督完成复杂编码任务。尽管vibe coding正被日益广泛地采用,但其输出成果是否真的能安全部署于生产环境?为回答这一问题,我们提出了SUSVIBES基准测试,该基准包含200项源自真实世界开源项目的功能需求软件工程任务——这些任务在交由人类程序员实现时曾产生存在漏洞的代码实现。我们基于该基准测试评估了多款采用前沿模型的常用编码智能体。令人不安的是,所有智能体在软件安全方面表现均不理想。虽然SWE-Agent配合Claude 4 Sonnet生成的解决方案中有61%功能正确,但仅10.5%具备安全性。进一步实验表明,初步安全策略(例如在功能需求中附加漏洞提示)无法缓解这些安全问题。我们的研究结果对vibe coding在安全敏感应用中的广泛采用提出了严重关切。