Verified invertible lexer using regular expressions and DFAs

In this project, we explore the concept of invertibility applied to serialisation and lexing frameworks. Recall that, on one hand, serialisation is the process of taking a data structure and writing it to a bit array while parsing is the reverse operation, i.e., reading the bit array and constructing the data structure back. While lexing, on the other hand, is the process of reading a stream of characters and splitting them into tokens, by following a list of given rules. While used in different applications, both are similar in their abstract operation: they both take a list of simple characters and extract a more complex structure. Applications in which these two operations are used are different but they share a need for the invertibility of the process. For example, when tokenising a code file that was prettyprinted by a compiler, one would expect to get the same sequence of tokens. Similarly, when a spacecraft sends scientific data to the ground, one would expect the parsed data to be the same as the one serialised by the spacecraft. The idea of this project is to explore the idea of having a framework capable of generating parser/serialiser or lexer/prettyprinter pairs with a formally verified notion of invertibility. We first explore related works and frameworks. After that, we present our verified lexer framework developed in Scala and verified using the Stainless framework1. We explain the implementation choices we make and present the specifications and their proofs. The code of the lexer with the proofs is available on Github2. The main branch contains the regular expression (called regex from now on) matcher version and the verified Computable Languages while the dfa match branch contains the version using the DFA matcher.

翻译：本项目探讨可逆性在序列化与词法分析框架中的应用。一方面，序列化是将数据结构转换为比特数组的过程，而解析是其逆操作，即读取比特数组并重建数据结构。另一方面，词法分析是通过给定规则读取字符流并将其拆分为词法单元的过程。尽管应用场景不同，两者在抽象操作层面具有相似性：皆以简单字符序列为输入，提取更复杂的结构。虽然这两种操作的应用领域各异，但均对过程可逆性存在共同需求。例如，对经编译器格式化输出的代码文件进行词法分析时，应获得与原文件一致的词法单元序列；同理，当航天器向地面传输科学数据时，解析所得数据应与航天器序列化的原始数据完全相同。本项目旨在构建能够生成具备形式化验证可逆性的解析器/序列化器或词法分析器/格式化输出器对偶的框架。我们首先综述相关研究与现有框架，随后展示基于Scala开发并利用Stainless框架验证的可逆词法分析器框架。文中详细阐述了实现方案的选择依据，并给出了形式化规范及其证明过程。包含验证代码的词法分析器已在Github开源：主分支提供基于正则表达式（下文简称regex）匹配器的实现及已验证的可计算语言模块，dfa_match分支则包含采用DFA匹配器的版本。