With their increasing capabilities, Large Language Models (LLMs) are now used across many industries. They have become useful tools for software engineers and support a wide range of development tasks. As LLMs are increasingly used in software development workflows, a critical question arises: are LLMs good at software security? At the same time, organizations worldwide invest heavily in cybersecurity to reduce exposure to disruptive attacks. The integration of LLMs into software engineering workflows may introduce new vulnerabilities and weaken existing security efforts. We introduce TOSSS (Two-Option Secure Snippet Selection), a benchmark that measures the ability of LLMs to choose between secure and vulnerable code snippets. Existing security benchmarks for LLMs cover only a limited range of vulnerabilities. In contrast, TOSSS relies on the CVE database and provides an extensible framework that can integrate newly disclosed vulnerabilities over time. Our benchmark gives each model a security score between 0 and 1 based on its behavior; a score of 1 indicates that the model always selects the secure snippet, while a score of 0 indicates that it always selects the vulnerable one. We evaluate 14 widely used open-source and closed-source models on C/C++ and Java code and observe scores ranging from 0.48 to 0.89. LLM providers already publish many benchmark scores for their models, and TOSSS could become a complementary security-focused score to include in these reports.
翻译:随着大型语言模型(LLM)能力的不断提升,其应用已遍及众多行业。它们已成为软件工程师的有力工具,支持广泛的开发任务。随着LLM在软件开发工作流中的使用日益增多,一个关键问题随之浮现:LLM是否擅长软件安全?与此同时,全球各组织在网络安全领域投入巨资以减少遭受破坏性攻击的风险。将LLM集成到软件工程工作流中,可能会引入新的漏洞并削弱现有的安全防护。我们提出了TOSSS(双选项安全代码片段选择基准),这是一个用于衡量LLM在安全代码片段与易受攻击代码片段之间选择能力的基准。现有的LLM安全基准仅覆盖有限范围的漏洞。相比之下,TOSSS基于CVE数据库,并提供了一个可扩展的框架,能够随时间整合新披露的漏洞。我们的基准根据模型的行为为其给出一个介于0到1之间的安全分数;得分为1表示模型始终选择安全代码片段,而得分为0则表示其始终选择易受攻击的片段。我们在C/C++和Java代码上评估了14个广泛使用的开源和闭源模型,观察到的分数范围从0.48到0.89。LLM提供商已为其模型发布了许多基准测试分数,TOSSS有望成为这些报告中一个补充性的、聚焦安全的评分指标。