SPECA: Specification-to-Checklist Agentic Auditing for Multi-Implementation Systems -- A Case Study on Ethereum Clients

Multi-implementation systems are increasingly audited against natural-language specifications. Differential testing scales well when implementations disagree, but it provides little signal when all implementations converge on the same incorrect interpretation of an ambiguous requirement. We present SPECA, a Specification-to-Checklist Auditing framework that turns normative requirements into checklists, maps them to implementation locations, and supports cross-implementation reuse. We instantiate SPECA in an in-the-wild security audit contest for the Ethereum Fusaka upgrade, covering 11 production clients. Across 54 submissions, 17 were judged valid by the contest organizers. Cross-implementation checks account for 76.5 percent (13 of 17) of valid findings, suggesting that checklist-derived one-to-many reuse is a practical scaling mechanism in multi-implementation audits. To understand false positives, we manually coded the 37 invalid submissions and find that threat model misalignment explains 56.8 percent (21 of 37): reports that rely on assumptions about trust boundaries or scope that contradict the audit's rules. We detected no High or Medium findings in the V1 deployment; misses concentrated in specification details and implicit assumptions (57.1 percent), timing and concurrency issues (28.6 percent), and external library dependencies (14.3 percent). Our improved agent, evaluated against the ground truth of a competitive audit, achieved a strict recall of 27.3 percent on high-impact vulnerabilities, placing it in the top 4 percent of human auditors and outperforming 49 of 51 contestants on critical issues. These results, though from a single deployment, suggest that early, explicit threat modeling is essential for reducing false positives and focusing agentic auditing effort. The agent-driven process enables expert validation and submission in about 40 minutes on average.

翻译：多实现系统日益依赖自然语言规范进行审计。当不同实现存在分歧时，差分测试具有良好的扩展性；但当所有实现因模糊需求而收敛于同一错误解释时，该方法提供的有效信号极少。本文提出SPECA（规范到清单审计框架），该框架将规范性需求转化为检查清单，将其映射至实现代码位置，并支持跨实现复用。我们在以太坊Fusaka升级的真实安全审计竞赛中实例化了SPECA框架，覆盖11个生产级客户端。在54份提交报告中，竞赛组织者判定17份为有效发现。其中跨实现检查贡献了76.5%（17项中的13项）的有效发现，表明基于清单的一对多复用机制在多实现审计中具备实际扩展价值。为探究误报成因，我们对37份无效报告进行人工编码分析，发现56.8%（37项中的21项）源于威胁模型失配：即报告所依据的信任边界或范围假设与审计规则相悖。在V1版本部署中未检测到高危或中危漏洞；遗漏问题主要集中在规范细节与隐式假设（57.1%）、时序与并发问题（28.6%）以及外部库依赖（14.3%）。基于竞赛审计真实数据评估的改进版代理，在高影响漏洞上达到27.3%的严格召回率，位列人类审计师前4%，在关键问题上表现优于51名参赛者中的49名。尽管源自单次部署，这些结果表明早期显式的威胁建模对降低误报率和聚焦代理审计资源至关重要。该代理驱动流程使专家平均能在约40分钟内完成验证与报告提交。