We present an executable, proven-safe, faithful, and future-proof Coq mechanization of JavaScript regular expression (regex) matching, as specified by the last published edition of ECMA-262 section 22.2. This is, to our knowledge, the first time that an industrial-strength regex language has been faithfully mechanized in an interactive theorem prover. We highlight interesting challenges that arose in the process (including issues of encoding, corner cases, and executability), and we document the steps that we took to ensure that the result is straightforwardly auditable and that our understanding of the spec aligns with existing implementations. We demonstrate the usability and versatility of the mechanization through a broad collection of analyses, case studies, and experiments: we prove that JavaScript regex matching always terminates and is safe (no assertion failures); we identifying subtle corner cases that led to mistakes in previous publications; we verify an optimization extracted from a state-of-the-art regex engine; we show that some classic properties described in automata textbooks and used in derivatives-based matchers do not hold in JavaScript regexes; and we demonstrate that the cost of updating the mechanization to account for changes in the original specification is reasonably low. Our mechanization can be extracted to OCaml and linked with Unicode libraries to produce an executable engine that passes the relevant parts of the official Test262 conformance test suite.
翻译:我们提出了一种可执行、经证明安全、忠实且可扩展的 Coq 形式化实现,用于 JavaScript 正则表达式(regex)匹配,该实现严格遵循 ECMA-262 第 22.2 节最新发布版本的规定。据我们所知,这是首次在交互式定理证明器中对一种工业级正则表达式语言进行忠实的完全形式化。我们重点阐述了在此过程中出现的有趣挑战(包括编码问题、边界情况和可执行性),并记录了为确保结果易于审计以及确保我们对规范的理解与现有实现一致而采取的措施。通过一系列广泛的分析、案例研究和实验,我们展示了该形式化实现的可用性和多功能性:我们证明了 JavaScript 正则表达式匹配总是终止且安全的(无断言失败);我们识别出先前出版物中导致错误的微妙边界情况;我们验证了从最先进的 regex 引擎中提取的优化;我们展示了自动机教材中描述且用于基于导数匹配器的一些经典性质并不适用于 JavaScript 正则表达式;以及我们证明了根据原始规范变更更新该形式化实现的成本相当低。我们的形式化实现可以提取为 OCaml 代码,并与 Unicode 库链接,从而生成一个可通过官方 Test262 一致性测试套件相关部分的可执行引擎。