Coqlex: Generating Formally Verified Lexers

A compiler consists of a sequence of phases going from lexical analysis to code generation. Ideally, the formal verification of a compiler should include the formal verification of each component of the tool-chain. An example is the CompCert project, a formally verified C compiler, that comes with associated tools and proofs that allow to formally verify most of those components. However, some components, in particular the lexer, remain unverified. In fact, the lexer of Compcert is generated using OCamllex, a lex-like OCaml lexer generator that produces lexers from a set of regular expressions with associated semantic actions. Even though there exist various approaches, like CakeML or Verbatim++, to write verified lexers, they all have only limited practical applicability. In order to contribute to the end-to-end verification of compilers, we implemented a generator of verified lexers whose usage is similar to OCamllex. Our software, called Coqlex, reads a lexer specification and generates a lexer equipped with a Coq proof of its correctness. It provides a formally verified implementation of most features of standard, unverified lexer generators. The conclusions of our work are two-fold: Firstly, verified lexers gain to follow a user experience similar to lex/flex or OCamllex, with a domain-specific syntax to write lexers comfortably. This introduces a small gap between the written artifact and the verified lexer, but our design minimizes this gap and makes it practical to review the generated lexer. The user remains able to prove further properties of their lexer. Secondly, it is possible to combine simplicity and decent performance. Our implementation approach that uses Brzozowski derivatives is noticeably simpler than the previous work in Verbatim++ that tries to generate a deterministic finite automaton (DFA) ahead of time, and it is also noticeably faster thanks to careful design choices. We wrote several example lexers that suggest that the convenience of using Coqlex is close to that of standard verified generators, in particular, OCamllex. We used Coqlex in an industrial project to implement a verified lexer of Ada. This lexer is part of a tool to optimize safety-critical programs, some of which are very large. This experience confirmed that Coqlex is usable in practice, and in particular that its performance is good enough. Finally, we performed detailed performance comparisons between Coqlex, OCamllex, and Verbatim++. Verbatim++ is the state-of-the-art tool for verified lexers in Coq, and the performance of its lexer was carefully optimized in previous work by Egolf and al. (2022). Our results suggest that Coqlex is two orders of magnitude slower than OCamllex, but two orders of magnitude faster than Verbatim++. Verified compilers and other language-processing tools are becoming important tools for safety-critical or security-critical applications. They provide trust and replace more costly approaches to certification, such as manually reading the generated code. Verified lexers are a missing piece in several Coq-based verified compilers today. Coqlex comes with safety guarantees, and thus shows that it is possible to build formally verified front-ends.

翻译：编译器由从词法分析到代码生成的多个阶段组成。理想情况下，编译器的形式化验证应包含工具链中每个组件的形式化验证。例如CompCert项目——一个经过形式化验证的C语言编译器，它提供了相关工具与证明，使得大部分组件能够被形式化验证。然而，某些组件（尤其是词法分析器）仍未得到验证。事实上，CompCert的词法分析器是通过OCamllex生成的，这是一种类lex的OCaml词法分析器生成器，能从一组带有相关语义动作的正则表达式生成词法分析器。尽管存在诸如CakeML或Verbatim++等方法用于编写经过验证的词法分析器，但其实际适用性均十分有限。为助力编译器的端到端验证，我们实现了一个已验证词法分析器的生成器，其用法与OCamllex类似。我们的软件名为Coqlex，它读取词法分析器规范，并生成一个附带Coq正确性证明的词法分析器。该工具提供了对标准非验证词法分析器生成器大多数功能的形式化验证实现。我们的工作结论有二：其一，经过验证的词法分析器应遵循类似lex/flex或OCamllex的用户体验，采用领域特定语法以方便编写词法分析器。这虽会在编写产物与验证后的词法分析器之间引入微小差距，但我们的设计将该差距最小化，使审查生成的词法分析器成为可能，且用户仍能证明词法分析器的其他性质。其二，实现简洁性与良好性能可兼得。我们采用Brzozowski导数的实现方法显著简化了此前Verbatim++中预先生成确定性有限自动机（DFA）的方案，同时通过精心设计实现了显著的速度提升。我们编写了多个示例词法分析器，表明使用Coqlex的便捷性与标准验证生成器（尤其是OCamllex）相近。在一个工业项目中，我们利用Coqlex实现了Ada语言验证词法分析器，该分析器是优化安全关键（部分为超大规模）程序的工具组件。实践证实Coqlex具备实际可用性，特别是其性能表现足够优秀。最后，我们对Coqlex、OCamllex和Verbatim++进行了详细的性能比较。Verbatim++是Coq中验证词法分析器的前沿工具，其词法分析器性能经Egolf等人（2022）精心优化。结果表明：Coqlex比OCamllex慢两个数量级，但比Verbatim++快两个数量级。验证编译器及其他语言处理工具正成为安全关键或安全攸关应用的重要工具。它们提供信任保障，并替代了人工审阅生成代码等成本更高的认证方法。验证词法分析器目前仍是多个基于Coq的验证编译器中的缺失环节。Coqlex提供安全保障，证明了构建形式化验证前端是可行的。