Beyond Code Reasoning: Specification-Anchored Auditing of Multi-Implementation Distributed Protocols

Code-driven auditing fails when correctness depends on what the specification requires rather than how the code is written. Production blockchain networks expose this directly: byzantine consensus runs many independent clients of a shared specification, so a specification-divergence defect in one client can fork the network or halt finality. Existing tools reason one repository at a time, with no shared baseline held constant across implementations. We present SPECA, an LLM-driven audit framework that derives explicit, categorized security properties (invariants, pre/postconditions, trust assumptions) from natural-language specifications and reuses them across implementations. SPECA enables controlled cross-implementation comparison, detections grounded in specification invariants no code pattern encodes, and false positives traceable to a specific pipeline phase rather than opaque model errors. On the Sherlock Ethereum Fusaka Audit Contest (10 targets, 366 submissions), SPECA recovers all 15 in-scope H/M/L vulnerabilities expert-augmented (8/15 automated-only) and surfaces 4 fix-confirmed bugs, including a cryptographic-invariant violation missed by every adjudicated finding. On the RepoAudit C/C++ benchmark, SPECA reaches 88.9% precision at 100% recall (F1=0.94) and surfaces 12 author-validated bugs beyond ground truth, two externally validated. SPECA also flags 5 of RepoAudit's 40 published bugs as defensive-coding fixes with no reachable exploit path. False positives trace to three pipeline-pinned root causes; a multi-model study identifies property-generation quality as the binding constraint. End-to-end cost is ~$30 per H/M/L bug (~42 min wall-clock under parallel execution).

翻译：当正确性取决于规格要求的内容而非代码编写方式时，基于代码驱动的审计就会失效。生产级区块链网络直接暴露了这一问题：拜占庭式共识需要运行独立实现同一规格的多个客户端，因此单个客户端中的规格偏差缺陷可能导致网络分叉或终结性停滞。现有工具一次只推理一个代码仓库，无法跨实现保持恒定基准。我们提出SPECA——一种基于LLM的审计框架，能从自然语言规格中派生出显式、分类的安全属性（不变条件、前置/后置条件、信任假设），并在多个实现中重复使用。SPECA支持受控的跨实现比较、基于规格不变条件（无代码模式编码）的检测，以及可追溯至特定流水线阶段（而非不透明模型错误）的假阳性。在Sherlock以太坊Fusaka审计竞赛（10个目标，366份提交）中，SPECA恢复了所有15个经专家增强的H/M/L级别漏洞（8/15仅通过自动化发现），并呈现了4个经修复确认的缺陷，包括每个裁定结果都遗漏的密码学不变条件违反。在RepoAudit C/C++基准测试中，SPECA在100%召回率下达到88.9%精确率（F1=0.94），并发现了超越真实标注的12个经作者验证的缺陷（其中两个经外部验证）。SPECA还将RepoAudit的40个已发布缺陷中的5个标记为防御性编码修复（无可达利用路径）。假阳性可追溯至三个流水线固定的根本原因；一项多模型研究将属性生成质量确定为瓶颈约束。端到端成本约为每个H/M/L漏洞30美元（并行执行下约42分钟墙钟时间）。