We present an executable, proven-safe, faithful, and future-proof Coq mechanization of JavaScript regular expression (regex) matching, as specified by the last published edition of ECMA-262 section 22.2. This is, to our knowledge, the first time that an industrial-strength regex language has been faithfully mechanized in an interactive theorem prover. We highlight interesting challenges that arose in the process (including issues of encoding, corner cases, and executability), and we document the steps that we took to ensure that the result is straightforwardly auditable and that our understanding of the spec aligns with existing implementations. We demonstrate the usability and versatility of the mechanization through a broad collection of analyses, case studies, and experiments: we prove that JavaScript regex matching always terminates and is safe (no assertion failures); we identifying subtle corner cases that led to mistakes in previous publications; we verify an optimization extracted from a state-of-the-art regex engine; we show that some classic properties described in automata textbooks and used in derivatives-based matchers do not hold in JavaScript regexes; and we demonstrate that the cost of updating the mechanization to account for changes in the original specification is reasonably low. Our mechanization can be extracted to OCaml and linked with Unicode libraries to produce an executable engine that passes the relevant parts of the official Test262 conformance test suite.
翻译:我们提出了一种可执行、经过验证安全、忠实且面向未来的JavaScript正则表达式匹配的Coq机械化实现,该实现遵循最新发布的ECMA-262标准第22.2节规范。据我们所知,这是首次在交互式定理证明器中完整实现工业级正则表达式语言的机械化。我们重点阐述了在此过程中出现的若干关键挑战(包括编码问题、边界情况与可执行性要求),并详细记录了为确保结果易于审查且使规范理解与现有实现保持一致所采取的措施。通过一系列广泛的分析、案例研究和实验,我们展示了该机械化实现的实用性与多功能性:证明了JavaScript正则表达式匹配始终具备终止性与安全性(无断言失败);识别出导致先前文献错误的微妙边界情况;验证了从先进正则表达式引擎提取的优化方案;揭示了自动机教科书描述的某些经典特性及基于导数的匹配器所依赖的性质在JavaScript正则表达式中并不成立;同时证明为适应原始规范变更而更新机械化实现的成本处于合理较低水平。我们的机械化实现可提取为OCaml代码,并与Unicode库链接生成可通过官方Test262一致性测试套件相关部分的可执行引擎。