We present an algorithm for regular expression parsing and submatch extraction based on tagged deterministic finite automata. The algorithm works with different disambiguation policies. We give detailed pseudocode for the algorithm, covering important practical optimizations. All transformations from a regular expression to an optimized automaton are explained on a step-by-step example. We consider both ahead-of-time and just-in-time determinization and describe variants of the algorithm suited to each setting. We provide benchmarks showing that the algorithm is very fast in practice. Our research is based on two independent implementations: an open-source lexer generator RE2C and an experimental Java library.
翻译:本文提出一种基于标记化确定性有限自动机的正则表达式解析与子匹配提取算法。该算法适用于多种消歧策略。我们提供了详细的算法伪代码,涵盖重要的实际优化技术。通过逐步示例演示了从正则表达式到优化自动机的所有转换过程。我们同时考察了预编译与即时确定性转换两种模式,并描述了适用于不同场景的算法变体。基准测试表明该算法在实际应用中具有极高的执行效率。本研究基于两个独立实现:开源词法分析器生成器RE2C与实验性Java库。