When extracting a relation of spans (intervals) from a text document, a common practice is to filter out tuples of the relation that are deemed dominated by others. The domination rule is defined as a partial order that varies along different systems and tasks. For example, we may state that a tuple is dominated by tuples which extend it by assigning additional attributes, or assigning larger intervals. The result of filtering the relation would then be the skyline according to this partial order. As this filtering may remove most of the extracted tuples, we study whether we can improve the performance of the extraction by compiling the domination rule into the extractor. To this aim, we introduce the skyline operator for declarative information extraction tasks expressed as document spanners. We show that this operator can be expressed via regular operations when the domination partial order can itself be expressed as a regular spanner, which covers several natural domination rules. Yet, we show that the skyline operator incurs a computational cost (under combined complexity). First, there are cases where the operator requires an exponential blowup on the number of states needed to represent the spanner as a sequential variable-set automaton. Second, the evaluation may become computationally hard. Our analysis more precisely identifies classes of domination rules for which the combined complexity is tractable or intractable.
翻译:在从文本文档中提取跨度(区间)关系时,一种常见做法是过滤掉被认为被其他元组支配的元组。支配规则被定义为一种偏序关系,这种关系随不同系统和任务而变化。例如,我们可以规定,一个元组被那些通过分配额外属性或分配更大区间来扩展它的元组所支配。根据这种偏序关系过滤后的结果即为Skyline。由于这种过滤可能移除大部分提取的元组,我们研究是否可以通过将支配规则编译到提取器中来改进提取性能。为此,我们引入了面向声明式信息提取任务(表示为文档提取器)的Skyline算子。我们证明,当支配偏序关系本身可表示为正则提取器时(这涵盖了多种自然支配规则),该算子可通过正则运算实现。然而,我们发现Skyline算子会带来计算代价(在组合复杂度下)。首先,在某些情况下,该算子会导致表示提取器所需的顺序变量集自动机状态数呈指数级增长。其次,其评估可能变得计算困难。我们的分析更精确地识别了那些组合复杂度可解或难解的支配规则类别。