Regular expression (RE) matching is a very common functionality that scans a text to find occurrences of patterns specified by an RE; it includes the simpler function of RE recognition. Here we address RE parsing, which subsumes matching by providing not just the pattern positions in the text, but also the syntactic structure of each pattern occurrence, in the form of a tree representing how the RE operators produced the patterns. RE parsing increases the selectivity of matching, yet avoiding the complications of context-free grammar parsers. Our parser manages ambiguous REs and texts by returning the set of all syntax trees, compressed into a Shared-Packed-Parse-Forest data-structure. We initially convert the RE into a serial parser, which simulates a finite automaton (FA) so that the states the automaton passes through encode the syntax tree of the input. On long texts, serial matching and parsing may be too slow for time-constrained applications. Therefore, we present a novel efficient parallel parser for multi-processor computing platforms; its speed-up over the serial algorithm scales well with the text length. We innovatively apply to RE parsing the approach typical of parallel RE matchers / recognizers, where the text is split into chunks to be parsed in parallel and then joined together. Such an approach suffers from the so-called speculation overhead, due to the lack of knowledge by a chunk processor about the state reached at the end of the preceding chunk; this forces each chunk processor to speculatively start in all its states. We introduce a novel technique that minimizes the speculation overhead. The multi-threaded parser program, written in Java, has been validated and its performance has been measured on a commodity multi-core computer, using public and synthetic RE benchmarks. The speed-up over serial parsing, parsing times, and parser construction times are reported.
翻译:正则表达式(RE)匹配是一种非常常见的功能,它通过扫描文本来查找由正则表达式指定的模式出现的位置;这包括更简单的正则表达式识别功能。本文我们探讨正则表达式解析,它通过不仅提供模式在文本中的位置,还以树的形式提供每个模式出现的句法结构(该树表示正则表达式运算符如何生成模式),从而包含了匹配功能。正则表达式解析提高了匹配的选择性,同时避免了上下文无关语法解析器的复杂性。我们的解析器通过返回所有语法树的集合(压缩为共享打包解析森林数据结构)来处理歧义性正则表达式和文本。我们首先将正则表达式转换为串行解析器,该解析器模拟有限自动机(FA),使得自动机经过的状态编码了输入的语法树。对于长文本,串行匹配和解析对于时间受限的应用可能过于缓慢。因此,我们提出了一种适用于多处理器计算平台的新型高效并行解析器;其相对于串行算法的加速比随文本长度具有良好的可扩展性。我们创新性地将并行正则表达式匹配器/识别器的典型方法应用于正则表达式解析,即将文本分割成块进行并行解析,然后合并结果。这种方法存在所谓的推测开销,这是由于块处理器缺乏关于前一个块结束时达到的状态信息;这迫使每个块处理器推测性地从其所有状态开始。我们引入了一种新技术来最小化推测开销。该多线程解析器程序用Java编写,已在商用多核计算机上使用公开和合成的正则表达式基准进行了验证和性能测量。文中报告了相对于串行解析的加速比、解析时间和解析器构建时间。