When extracting a relation of spans (intervals) from a text document, a common practice is to filter out tuples of the relation that are deemed dominated by others. The domination rule is defined as a partial order that varies along different systems and tasks. For example, we may state that a tuple is dominated by tuples which extend it by assigning additional attributes, or assigning larger intervals. The result of filtering the relation would then be the skyline according to this partial order. As this filtering may remove most of the extracted tuples, we study whether we can improve the performance of the extraction by compiling the domination rule into the extractor. To this aim, we introduce the skyline operator for declarative information extraction tasks expressed as document spanners. We show that this operator can be expressed via regular operations when the domination partial order can itself be expressed as a regular spanner, which covers several natural domination rules. Yet, we show that the skyline operator incurs a computational cost (under combined complexity). First, there are cases where the operator requires an exponential blowup on the number of states needed to represent the spanner as a sequential variable-set automaton. Second, the evaluation may become computationally hard. Our analysis more precisely identifies classes of domination rules for which the combined complexity is tractable or intractable.
翻译:从文本文档中提取跨度(区间)关系时,通常做法是滤除被认为被其他元组支配的元组。支配规则定义为一种偏序关系,其具体形式随不同系统和任务而变化。例如,我们可以声明:通过分配额外属性或分配更大区间来扩展元组的元组将支配原元组。根据该偏序关系对关系进行过滤后得到的结果即为skyline。由于这种过滤可能移除大部分提取的元组,我们研究能否通过将支配规则编译到提取器中来提升提取性能。为此,我们为声明式信息提取任务(可表达为文档跨度提取)引入skyline算子。研究表明,当支配偏序本身可表达为正则跨度提取时(这涵盖多种自然支配规则),该算子可通过正则操作实现。然而,skyline算子在(组合复杂度下)会产生计算代价:首先,存在需要将跨度提取表示为顺序变量集自动机时状态数量呈指数级增长的情况;其次,求值可能变得计算困难。我们的分析精确识别出支配规则的类别,并确定了哪些类别的组合复杂度是可处理的,哪些是不可处理的。