Lexers and parsers are typically defined separately and connected by a token stream. This separate definition is important for modularity and reduces the potential for parsing ambiguity. However, materializing tokens as data structures and case-switching on tokens comes with a cost. We show how to fuse separately-defined lexers and parsers, drastically improving performance without compromising modularity or increasing ambiguity. We propose a deterministic variant of Greibach Normal Form that ensures deterministic parsing with a single token of lookahead and makes fusion strikingly simple, and prove that normalizing context free expressions into the deterministic normal form is semantics-preserving. Our staged parser combinator library, flap, provides a standard interface, but generates specialized token-free code that runs two to six times faster than ocamlyacc on a range of benchmarks.
翻译:词法分析器和语法分析器通常分别定义,并通过 token 流连接。这种分离式定义对于模块化非常重要,并减少了解析歧义的可能性。然而,将 token 具体化为数据结构并对其进行 case 切换会带来性能开销。我们展示了如何融合分别定义的词法分析器和语法分析器,从而在不损害模块化或增加歧义的情况下显著提升性能。我们提出了 Greibach 范式的确定性变体,该变体确保仅需单个 lookahead token 即可实现确定性解析,并使融合过程极为简洁,同时证明了将上下文无关表达式规范化为确定性范式的语义保持性。我们的分阶段解析器组合子库 flap 提供了标准接口,但能够生成专门的免 token 代码,在一系列基准测试中,其运行速度比 ocamlyacc 快两到六倍。