We present ZipLex, a verified framework for invertible linear-time lexical analysis following the longest match (maximal munch) semantics. Unlike past verified lexers that focus only on satisfying the semantics of regular expressions and the longest match property, ZipLex also guarantees that lexing and printing are mutual inverses. Thanks to verified memoization, it also ensures that the lexical analysis of a string is linear in the size of the string. Our design and implementation rely on two sets of ideas: (1) a new abstraction of token sequences that captures the separability of tokens in a sequence while supporting their efficient manipulation, and (2) a combination of verified data structures and optimizations, including Huet's zippers and memoization with a standalone verified imperative hash table. Our hash table offers competitive performance as shown by our evaluation. We implemented and verified ZipLex using the Stainless deductive verifier for Scala. Our evaluation demonstrates that ZipLex supports realistic applications such as JSON processing and lexers of programming languages, and behaves linearly even in cases that make flex-style approaches quadratic. ZipLex is two orders of magnitude faster than Verbatim++, showing that verified invertibility and linear-time algorithms can be developed without prohibitive cost. Compared to Coqlex, ZipLex also offers linear (instead of quadratic) time lexing, and is the first lexer that comes with invertibility proofs for printing token sequences.
翻译:我们提出了ZipLex,一个遵循最长匹配(最大咀嚼)语义的可逆线性时间词法分析验证框架。与过去仅关注满足正则表达式语义和最长匹配属性的已验证词法分析器不同,ZipLex还保证词法分析和打印互为逆操作。得益于已验证的记忆化技术,它确保字符串的词法分析时间与字符串大小呈线性关系。我们的设计与实现依赖两组核心思想:(1)一种新的词元序列抽象,能够捕获序列中词元的可分性同时支持高效操作;(2)已验证数据结构与优化的组合,包括Huet的拉链技术和带独立已验证命令式哈希表的记忆化机制。评估表明我们的哈希表具有竞争性性能。我们使用Scala的Stainless演绎验证器实现并验证了ZipLex。评估显示ZipLex支持JSON处理与编程语言词法分析器等实际应用,即使在使flex风格方法呈现二次复杂度的场景中仍保持线性行为。ZipLex的速度比Verbatim++快两个数量级,证明验证的可逆性与线性时间算法可以在可接受成本下开发。与Coqlex相比,ZipLex提供线性(而非二次)时间词法分析,并且是首个附带打印词元序列可逆性证明的词法分析器。