Large-alphabet strings are common in scenarios such as information retrieval and natural-language processing. The efficient storage and processing of such strings usually introduces several challenges that are not witnessed in small-alphabets strings. This paper studies the efficient implementation of one of the most effective approaches for dealing with large-alphabet strings, namely the \emph{alphabet-partitioning} approach. The main contribution is a compressed data structure that supports the fundamental operations $rank$ and $select$ efficiently. We show experimental results that indicate that our implementation outperforms the current realizations of the alphabet-partitioning approach. In particular, the time for operation $select$ can be improved by about 80%, using only 11% more space than current alphabet-partitioning schemes. We also show the impact of our data structure on several applications, like the intersection of inverted lists (where improvements of up to 60% are achieved, using only 2% of extra space), the representation of run-length compressed strings, and the distributed-computation processing of $rank$ and $select$ operations. In the particular case of run-length compressed strings, our experiments on the Burrows-Wheeler transform of highly-repetitive texts indicate that by using only about 0.98--1.09 times the space of state-of-the-art RLFM-indexes (depending on the text), the process of counting the number of occurrences of a pattern in a text can be carried out 1.23--2.33 times faster.
翻译:大字母表字符串常见于信息检索和自然语言处理等场景中。此类字符串的高效存储与处理通常会带来若干在小字母表字符串中未曾出现的挑战。本文研究处理大字母表字符串最有效方法之一——字母表划分方法的高效实现。主要贡献在于提出一种支持基本操作rank和select的高效压缩数据结构。实验结果表明,我们的实现优于当前字母表划分方法的具体实现。特别是,select操作的时间性能可提升约80%,而空间开销仅比现有字母表划分方案增加11%。我们还展示了该数据结构在倒排列表交集(仅需2%额外空间即可实现高达60%的性能提升)、游程编码压缩字符串表示以及rank/select操作的分布式计算处理等多项应用中的影响。针对游程编码压缩字符串的特例,我们在高重复文本的Burrows-Wheeler变换上的实验表明:当空间开销仅为当前最先进RLFM-index的0.98-1.09倍(取决于文本类型)时,模式出现次数统计的处理速度可提升1.23-2.33倍。