Cryptographic protocols play a fundamental role in securing modern digital infrastructure, but they are often deployed without prior formal verification. This could lead to the adoption of distributed systems vulnerable to attack vectors. Formal verification methods, on the other hand, require complex and time-consuming techniques that lack automatization. In this paper, we introduce a benchmark to assess the ability of Large Language Models (LLMs) to autonomously identify vulnerabilities in new cryptographic protocols through interaction with Tamarin: a theorem prover for protocol verification. We created a manually validated dataset of novel, flawed, communication protocols and designed a method to automatically verify the vulnerabilities found by the AI agents. Our results about the performances of the current frontier models on the benchmark provides insights about the possibility of cybersecurity applications by integrating LLMs with symbolic reasoning systems.
翻译:密码协议在保障现代数字基础设施安全方面发挥着基础性作用,但其部署前往往缺乏形式化验证。这可能导致采用易受攻击向量威胁的分布式系统。另一方面,形式化验证方法需要复杂耗时的技术,且缺乏自动化能力。本文提出一个基准测试,用于评估大型语言模型通过与协议验证定理证明器Tamarin交互,自主识别新型密码协议漏洞的能力。我们构建了一个经人工验证的、包含新型缺陷通信协议的数据集,并设计了一种自动验证AI智能体所发现漏洞的方法。当前前沿模型在该基准测试上的性能结果,为通过集成大型语言模型与符号推理系统实现网络安全应用提供了重要洞见。