Beyond Code Reasoning: A Specification-Anchored Audit Framework for Expert-Augmented Security Verification

Security-critical software is routinely audited by tools that reason about vulnerabilities as repository-local code patterns. Yet specification-governed systems -- protocol stacks, consensus implementations, cryptographic libraries -- are constrained by invariants and correctness conditions defined in natural-language specifications. When a vulnerability arises from what the specification requires rather than how code is written, code-level approaches lack the representational vocabulary to detect it, and their false positives resist systematic diagnosis. We present SPECA, a specification-anchored security audit framework that derives explicit, typed security properties from natural-language specifications and audits implementations through structured proof-attempt reasoning grounded in each property. The framework yields three capabilities absent from code-driven auditing: spec-dependent detections, controlled cross-implementation comparison under a shared property vocabulary, and false positives that decompose into interpretable, pipeline-phase-traceable root causes. On the Sherlock Ethereum Fusaka Audit Contest (366 submissions, 10 implementations), SPECA recovers all 15 in-scope vulnerabilities and independently discovers 4 bugs confirmed by developer fix commits. On the RepoAudit C/C++ benchmark (15 projects), SPECA matches the best published precision (88.9\%) while surfacing 12 candidate bugs beyond the established ground truth, two confirmed by upstream maintainers. A multi-model analysis reveals that more capable models audit more faithfully within property scope, shifting the detection bottleneck from model reasoning to property generation quality. All false positives trace to three recurring root causes -- trust boundary misunderstanding, code reading errors, and specification misinterpretation -- each yielding actionable improvement targets.

翻译：安全关键型软件通常由基于漏洞作为仓库本地代码模式进行推理的工具进行审计。然而，受规范约束的系统（如协议栈、共识实现、密码学库）受限于自然语言规范中定义的不变量和正确性条件。当漏洞源于规范要求（而非代码编写方式）时，代码级方法缺乏表达该问题的表征词汇，其误报也难以系统化诊断。我们提出 **SPECA**，一个规范锚定的安全审计框架，从自然语言规范中推导出显式、类型化的安全属性，并通过基于每个属性构建的结构化证明尝试推理来审计实现。该框架提供了代码驱动型审计所不具备的三种能力：规范依赖的检测、在共享属性词汇表下受控的跨实现比较，以及可分解为可解释、管道阶段可追踪根因的误报。在 Sherlock Ethereum Fusaka 审计竞赛（366 份提交，10 个实现）中，SPECA 恢复了所有 15 个范围内的漏洞，并独立发现了 4 个经开发者修复提交确认的缺陷。在 RepoAudit C/C++ 基准测试（15 个项目）中，SPECA 达到了已发表的最佳精度（88.9%），同时在既定真实标记之外浮现出 12 个候选漏洞，其中两个已获上游维护者确认。多项模型分析表明，能力更强的模型在属性作用域内审计更忠实，将检测瓶颈从模型推理转移至属性生成质量。所有误报均可追溯至三种反复出现的根因（信任边界误解、代码阅读错误、规范误读），每个根因均指向可操作的改进目标。