Revisiting Weighted Information Extraction: A Simpler and Faster Algorithm for Ranked Enumeration

Information extraction from textual data, where the query is represented by a finite transducer and the task is to enumerate all results without repetition, and its extension to the weighted case, where each output element has a weight and the output elements are to be enumerated sorted by their weights, are important and well studied problems in database theory. On the one hand, the first framework already covers the well-known case of regular document spanners, while the latter setting covers several practically relevant tasks that cannot be described in the unweighted setting. It is known that in the unweighted case this problem can be solved with linear time preprocessing $O(|D|)$ and output-linear delay $O(|s|)$ in data complexity, where $D$ is the input data and $s$ is the current output element. For the weighted case, Bourhis, Grez, Jachiet, and Riveros [ICDT 2021] recently designed an algorithm with linear time preprocessing, but the delay of $O(|s| \cdot \log|\mathsf{D}|)$ depends on the size of the data. We first show how to leverage the existing results on enumerating shortest paths to obtain a simple alternative algorithm with linear preprocessing and a delay of $O(|s_i| + \min\{ \log i, \log|\mathsf{D}|\})$ for the $i^{\text{th}}$ output element $s_i$ (in data complexity); thus, substantially improving the previous algorithm. Next, we develop a technically involved rounding technique that allows us to devise an algorithm with linear time preprocessing and output-linear delay $O(|s|)$ with high probability. To this end, we combine tools from algebra, high-dimensional geometry, and linear programming.

翻译：从文本数据中提取信息是一个重要且被深入研究的数据理论问题，其查询由有限状态转换器表示，任务是无重复地枚举所有结果。该问题可扩展至加权情形，其中每个输出元素具有权重，且输出元素需按权重排序枚举。一方面，前者框架已涵盖著名的正则文档解析器情形，而后者设置则覆盖了未加权设置中无法描述的若干实际相关任务。已知在未加权情形下，该问题可通过线性时间预处理$O(|D|)$和数据复杂度下输出线性延迟$O(|s|)$求解，其中$D$为输入数据，$s$为当前输出元素。对于加权情形，Bourhis、Grez、Jachiet和Riveros[ICDT 2021]近期设计了一种具有线性时间预处理的算法，但其$O(|s| \cdot \log|\mathsf{D}|)$延迟依赖于数据规模。我们首先展示如何利用现有最短路径枚举结果，获得一种具有线性预处理且对第$i$个输出元素$s_i$（在数据复杂度下）延迟为$O(|s_i| + \min\{ \log i, \log|\mathsf{D}|\})$的简单替代算法，从而显著改进先前算法。接着，我们开发了一种技术复杂的舍入技术，使我们能够设计出具有线性时间预处理和高概率下输出线性延迟$O(|s|)$的算法。为此，我们结合了代数、高维几何和线性规划中的工具。