A 'degenerate string' is a sequence of subsets of some alphabet; it represents any string obtainable by selecting one character from each set from left to right. Recently, Alanko et al. generalized the rank-select problem to degenerate strings, where given a character $c$ and position $i$ the goal is to find either the $i$th set containing $c$ or the number of occurrences of $c$ in the first $i$ sets [SEA 2023]. The problem has applications to pangenomics; in another work by Alanko et al. they use it as the basis for a compact representation of 'de Bruijn Graphs' that supports fast membership queries. In this paper we revisit the rank-select problem on degenerate strings, introducing a new, natural parameter and reanalyzing existing reductions to rank-select on regular strings. Plugging in standard data structures, the time bounds for queries are improved exponentially while essentially matching, or improving, the space bounds. Furthermore, we provide a lower bound on space that shows that the reductions lead to succinct data structures in a wide range of cases. Finally, we provide implementations; our most compact structure matches the space of the most compact structure of Alanko et al. while answering queries twice as fast. We also provide an implementation using modern vector processing features; it uses less than one percent more space than the most compact structure of Alanko et al. while supporting queries four to seven times faster, and has competitive query time with all the remaining structures.
翻译:“退化字符串”是某个字母表子集的序列;它表示通过从左到右从每个子集中选择一个字符而得到的任意字符串。近期,Alanko等人将排名-选择问题推广至退化字符串,即给定字符$c$和位置$i$,目标是找到包含$c$的第$i$个子集,或前$i$个子集中$c$的出现次数[SEA 2023]。该问题在泛基因组学中有应用;在Alanko等人的另一项工作中,他们将其用作支持快速成员查询的‘德布鲁因图’紧凑表示的基础。本文重新审视退化字符串上的排名-选择问题,引入一个新的自然参数,并重新分析现有向常规字符串排名-选择问题的归约。通过嵌入标准数据结构,查询的时间界限呈指数级改进,同时基本匹配或改进了空间界限。此外,我们给出了空间下界,表明这些归约在广泛情况下可产生简洁数据结构。最后,我们提供了实现;在最紧凑结构上,我们与Alanko等人最紧凑结构空间相同,但查询速度提升两倍。我们还提供了利用现代向量处理特性的实现;其空间占用比Alanko等人最紧凑结构多不到百分之一,但查询速度提升四到七倍,且与其余所有结构相比具有竞争力的查询时间。