flap: A Deterministic Parser with Fused Lexing

Lexers and parsers are typically defined separately and connected by a token stream. This separate definition is important for modularity and reduces the potential for parsing ambiguity. However, materializing tokens as data structures and case-switching on tokens comes with a cost. We show how to fuse separately-defined lexers and parsers, drastically improving performance without compromising modularity or increasing ambiguity. We propose a deterministic variant of Greibach Normal Form that ensures deterministic parsing with a single token of lookahead and makes fusion strikingly simple, and prove that normalizing context free expressions into the deterministic normal form is semantics-preserving. Our staged parser combinator library, flap, provides a standard interface, but generates specialized token-free code that runs two to six times faster than ocamlyacc on a range of benchmarks.

翻译：词法分析器和语法分析器通常分别定义，并通过 token 流连接。这种分离式定义对于模块化非常重要，并减少了解析歧义的可能性。然而，将 token 具体化为数据结构并对其进行 case 切换会带来性能开销。我们展示了如何融合分别定义的词法分析器和语法分析器，从而在不损害模块化或增加歧义的情况下显著提升性能。我们提出了 Greibach 范式的确定性变体，该变体确保仅需单个 lookahead token 即可实现确定性解析，并使融合过程极为简洁，同时证明了将上下文无关表达式规范化为确定性范式的语义保持性。我们的分阶段解析器组合子库 flap 提供了标准接口，但能够生成专门的免 token 代码，在一系列基准测试中，其运行速度比 ocamlyacc 快两到六倍。

相关内容

词法分析

关注 204

词法分析（英语：lexical analysis）是计算机科学中将字符序列转换为单词（Token）序列的过程。词法分析（lexical analysis）包括汉语分词和词性标注两部分。和大部分西方语言不同，汉语书面语词语之间没有明显的空格标记，文本中的句子以字串的形式出现。因此汉语自然语言处理的首要工作就是要将输入的字串切分为单独的词语，然后在此基础上进行其他更高级的分析，这一步骤称为分词（word segmentation 或tokenization）。除了分词，词性标注也通常认为是词法分析的一部分。给定一个切好词的句子，词性标注的目的是为每一个词赋予一个类别，这个类别称为词性标记（part-of-speech tag），比如，名词（noun）、动词（verb）、形容词（adjective）等。

【超赞的#C++#速查&信息图】“hacking c++ - Cheat Sheets & Infographics”

专知会员服务

30+阅读 · 2022年3月8日

2020数据工程师成长路线图

专知会员服务

19+阅读 · 2020年9月6日

【ACL2020】命名实体识别即依存解析，Named Entity Recognition as Dependency Parsing

专知会员服务

61+阅读 · 2020年5月15日