With their increasing capabilities, Large Language Models (LLMs) are now used across many industries. They have become useful tools for software engineers and support a wide range of development tasks. As LLMs are increasingly used in software development workflows, a critical question arises: are LLMs good at software security? At the same time, organizations worldwide invest heavily in cybersecurity to reduce exposure to disruptive attacks. The integration of LLMs into software engineering workflows may introduce new vulnerabilities and weaken existing security efforts. We introduce TOSSS (Two-Option Secure Snippet Selection), a benchmark that measures the ability of LLMs to choose between secure and vulnerable code snippets. Existing security benchmarks for LLMs cover only a limited range of vulnerabilities. In contrast, TOSSS relies on the CVE database and provides an extensible framework that can integrate newly disclosed vulnerabilities over time. Our benchmark gives each model a security score between 0 and 1 based on its behavior; a score of 1 indicates that the model always selects the secure snippet, while a score of 0 indicates that it always selects the vulnerable one. We evaluate 14 widely used open-source and closed-source models on C/C++ and Java code and observe scores ranging from 0.48 to 0.89. LLM providers already publish many benchmark scores for their models, and TOSSS could become a complementary security-focused score to include in these reports.
翻译:随着其能力的不断提升,大型语言模型(LLMs)现已在众多行业中得到应用。它们已成为软件工程师的有用工具,并支持广泛的开发任务。随着LLMs在软件开发工作流中的使用日益增多,一个关键问题随之产生:LLMs是否擅长软件安全?与此同时,全球各组织在网络安全领域投入巨资以减少遭受破坏性攻击的风险。将LLMs集成到软件工程工作流中可能会引入新的漏洞并削弱现有的安全防护。我们提出了TOSSS(双选项安全代码片段选择),这是一个用于衡量LLM在安全代码片段与易受攻击代码片段之间做出选择的能力的基准测试。现有的LLM安全基准测试仅涵盖有限范围的漏洞。相比之下,TOSSS基于CVE数据库,并提供了一个可扩展的框架,能够随时间整合新披露的漏洞。我们的基准测试根据模型的行为为其给出一个介于0到1之间的安全分数;得分为1表示模型总是选择安全代码片段,而得分为0表示它总是选择易受攻击的代码片段。我们在C/C++和Java代码上评估了14个广泛使用的开源和闭源模型,观察到的分数范围从0.48到0.89。LLM提供商已为其模型发布了许多基准测试分数,TOSSS有望成为这些报告中一个补充性的、专注于安全的评分指标。