Information extraction from textual data, where the query is represented by a finite transducer and the task is to enumerate all results without repetition, and its extension to the weighted case, where each output element has a weight and the output elements are to be enumerated sorted by their weights, are important and well studied problems in database theory. On the one hand, the first framework already covers the well-known case of regular document spanners, while the latter setting covers several practically relevant tasks that cannot be described in the unweighted setting. It is known that in the unweighted case this problem can be solved with linear time preprocessing O(|D|) and output-linear delay O(|s|) in data complexity, where D is the input data and s is the current output element. For the weighted case, Bourhis, Grez, Jachiet, and Riveros [ICDT 2021] recently designed an algorithm with linear time preprocessing, but the delay of O(|s| log(|D|)) depends on the size of the data. We first show how to leverage the existing results on enumerating shortest paths to obtain a simple alternative algorithm with linear preprocessing and a delay of O(|s_i| + min{ log(i), \log(|D|)}) for the i^{th} output element s_i (in data complexity); thus, substantially improving the previous algorithm. Next, we develop a technically involved rounding technique that allows us to devise an algorithm with linear time preprocessing and output-linear delay O(|s|) with high probability. To this end, we combine tools from algebra, high-dimensional geometry, and linear programming.
翻译:从文本数据中提取信息是一个重要且被深入研究的数据库理论问题,其查询由有限状态转换器表示,任务是无重复地枚举所有结果。该问题可扩展至加权情形,其中每个输出元素具有权重,且输出元素需按权重排序枚举。一方面,前者框架已涵盖正则文档解析器这一经典案例,而后者则覆盖了未加权框架无法描述的若干实际相关任务。已知在未加权情况下,该问题可通过线性时间预处理O(|D|)和输出线性延迟O(|s|)(数据复杂度)解决,其中D为输入数据,s为当前输出元素。对于加权情形,Bourhis、Grez、Jachiet和Riveros [ICDT 2021]近期设计了一种具有线性时间预处理的算法,但其O(|s| log(|D|))延迟依赖于数据规模。我们首先展示如何利用现有最短路径枚举结果,获得一种具有线性预处理和O(|s_i| + min{ log(i), \log(|D|)})延迟(针对第i个输出元素s_i,数据复杂度)的简洁替代算法,从而显著改进先前算法。接着,我们开发了一种技术复杂的舍入技术,使得能够设计出具有线性时间预处理和高概率输出线性延迟O(|s|)的算法。为此,我们融合了代数、高维几何和线性规划领域的工具。