Given a text, rank and select queries return the number of occurrences of a character up to a position (rank) or the position of a character with a given rank (select). These queries have applications in, e.g., compression, computational geometry, and most notably pattern matching in the form of the backward search -- the backbone of many compressed full-text indices. Currently, in practice, for text over non-binary alphabets, the wavelet tree is probably the most used data structure for rank and select queries. In this paper, we present techniques to speed up queries by a factor of two (access and select) up to three (rank), compared to the wavelet tree implementation contained in the widely used Succinct Data Structure Library (SDSL). To this end, we change the underlying tree structure from a binary tree to a 4-ary tree and reduce cache misses by approximating rank queries using a predictive model to prefetch all data required for the actual rank query.
翻译:给定一个文本,rank和select查询返回字符在某个位置之前出现的次数(rank)或具有给定rank的字符的位置(select)。这些查询在压缩、计算几何以及最重要的模式匹配(如向后搜索)中具有应用,后者是许多压缩全文索引的支柱。目前,在非二进制字母表上的文本处理实践中,小波树可能是最常用的用于rank和select查询的数据结构。在本文中,我们提出了一系列技术,与广泛使用的Succinct数据结构库(SDSL)中实现的小波树相比,能够将查询速度提升两倍(访问和select)至三倍(rank)。为此,我们将底层树结构从二叉树改为四叉树,并通过使用预测模型近似rank查询来减少缓存未命中,从而预取实际rank查询所需的所有数据。