Lexers and parsers are typically defined separately and connected by a token stream. This separate definition is important for modularity and reduces the potential for parsing ambiguity. However, materializing tokens as data structures and case-switching on tokens comes with a cost. We show how to fuse separately-defined lexers and parsers, drastically improving performance without compromising modularity or increasing ambiguity. We propose a deterministic variant of Greibach Normal Form that ensures deterministic parsing with a single token of lookahead and makes fusion strikingly simple, and prove that normalizing context free expressions into the deterministic normal form is semantics-preserving. Our staged parser combinator library, flap, provides a standard interface, but generates specialized token-free code that runs two to six times faster than ocamlyacc on a range of benchmarks.
翻译:词法分析器与解析器通常分开定义,并通过词法单元流连接。这种分离定义对模块化至关重要,并能降低解析歧义的可能性。然而,将词法单元具体化为数据结构并进行模式分支操作会带来性能开销。我们展示了如何融合分别定义的词法分析器和解析器,在不牺牲模块化或增加歧义的前提下大幅提升性能。我们提出了一种确定性格雷巴赫范式变体,该范式确保仅需单个向前查看词法单元即可实现确定性解析,使融合过程异常简洁,并证明了将上下文无关表达式归一化为该确定范式具有语义保持性。我们的分段解析器组合子库flap提供标准接口,但生成了免词法单元专用代码,在多个基准测试中运行速度比ocamlyacc快两到六倍。