We present a theoretical framework for the extraction and transformation of text documents. We propose to use a two-phase process where the first phase extracts span-tuples from a document, and the second phase maps the content of the span-tuples into new documents. We base the extraction phase on the framework of document spanners and the transformation phase on the theory of polyregular functions, the class of regular string-to-string functions with polynomial growth. For supporting practical extract-transform scenarios, we propose an extension of document spanners described by regex formulas from span-tuples to so-called multispan-tuples, where variables are mapped to sets of spans. We prove that this extension, called regex multispanners, has the same desirable properties as standard spanners described by regex formulas. In our framework, an Extract-Transform (ET) program is given by a regex multispanner followed by a polyregular function. In this paper, we study the expressibility and evaluation problem of ET programs when the transformation function is linear, called linear ET programs. We show that linear ET programs are equally expressive as non-deterministic streaming string transducers under bag semantics. Moreover, we show that linear ET programs are closed under composition. Finally, we present an enumeration algorithm for evaluating every linear ET program over a document with linear time preprocessing and constant delay.
翻译:我们提出了一种用于文本文档提取与转换的理论框架。我们采用两阶段过程:第一阶段从文档中提取跨度元组,第二阶段将跨度元组的内容映射到新文档中。提取阶段基于文档跨度器框架,转换阶段则基于多项式正则函数理论(即具有多项式增长的正则字符串到字符串函数类)。为支持实际提取-转换场景,我们提出将基于正则公式描述的文档跨度器从跨度元组扩展至所谓的多重跨度元组——其中变量被映射到跨度集合。我们证明这种扩展(称为正则多跨度器)具有与标准正则公式跨度器相同的优良性质。在此框架中,提取-转换程序由正则多跨度器后接多项式正则函数构成。本文研究了当转换函数为线性时(称为线性提取-转换程序)的可表达性与评估问题。我们证明线性提取-转换程序在袋语义下与非确定性流式字符串转换器具有相同表达能力,且线性提取-转换程序对组合运算封闭。最后,我们提出一种枚举算法,该算法能以线性时间预处理和常数延迟评估文档上的任意线性提取-转换程序。