We study the problem of enumerating results from a query over a compressed document. The model we use for compression are straight-line programs (SLPs), which are defined by a context-free grammar that produces a single string. For our queries, we use a model called Annotated Automata, an extension of regular automata that allows annotations on letters. This model extends the notion of Regular Spanners as it allows arbitrarily long outputs. Our main result is an algorithm that evaluates such a query by enumerating all results with output-linear delay after a preprocessing phase which takes linear time on the size of the SLP, and cubic time over the size of the automaton. This is an improvement over Schmid and Schweikardt's result, which, with the same preprocessing time, enumerates with a delay that is logarithmic on the size of the uncompressed document. We achieve this through a persistent data structure named Enumerable Compact Sets with Shifts which guarantees output-linear delay under certain restrictions. These results imply constant-delay enumeration algorithms in the context of regular spanners. Further, we use an extension of annotated automata which utilizes succinctly encoded annotations to save an exponential factor from previous results that dealt with constant-delay enumeration over vset automata. Lastly, we extend our results in the same fashion Schmid and Schweikardt did to allow complex document editing while maintaining the constant delay guarantee.
翻译:我们研究在压缩文档上执行查询时枚举结果的问题。我们采用的压缩模型是直线程序(SLP),它由生成单个字符串的上下文无关文法定义。对于查询,我们使用一种称为带注释自动机的模型,该模型是正则自动机的扩展,允许在字母上添加注释。该模型扩展了正则选择子(Regular Spanners)的概念,因为它支持任意长度的输出。我们的主要贡献在于提出一种算法,通过预处理阶段后以输出线性延迟枚举所有结果:预处理阶段对SLP规模花费线性时间,对自动机规模花费三次时间。相较于Schmid和Schweikardt的结果(在相同预处理时间下,其枚举延迟与未压缩文档规模成对数关系),这实现了改进。我们通过一种名为"可枚举紧集带移位"的持久化数据结构实现这一目标,该结构在特定约束下保证输出线性延迟。这些结果意味着在正则选择子框架下可实现恒定延迟枚举算法。此外,我们利用一种扩展的带注释自动机,通过使用简洁编码的注释,在处理vset自动机的恒定延迟枚举时,将先前结果所需的指数因子节省为常数。最后,我们沿用Schmid和Schweikardt的方法扩展结果,在保持恒定延迟保证的同时支持复杂文档编辑操作。