Space-Efficient Indexes for Uncertain Strings

Strings in the real world are often encoded with some level of uncertainty. In the character-level uncertainty model, an uncertain string $X$ of length $n$ on an alphabet $\Sigma$ is a sequence of $n$ probability distributions over $\Sigma$. Given an uncertain string $X$ and a weight threshold $\frac{1}{z}\in(0,1]$, we say that pattern $P$ occurs in $X$ at position $i$, if the product of probabilities of the letters of $P$ at positions $i,\ldots,i+|P|-1$ is at least $\frac{1}{z}$. While indexing standard strings for online pattern searches can be performed in linear time and space, indexing uncertain strings is much more challenging. Specifically, the state-of-the-art index for uncertain strings has $\mathcal{O}(nz)$ size, requires $\mathcal{O}(nz)$ time and $\mathcal{O}(nz)$ space to be constructed, and answers pattern matching queries in the optimal $\mathcal{O}(m+|\text{Occ}|)$ time, where $m$ is the length of $P$ and $|\text{Occ}|$ is the total number of occurrences of $P$ in $X$. For large $n$ and (moderate) $z$ values, this index is completely impractical to construct, which outweighs the benefit of the supported optimal pattern matching queries. We were thus motivated to design a space-efficient index at the expense of slower yet competitive pattern matching queries. We propose an index of $\mathcal{O}(\frac{nz}{\ell}\log z)$ expected size, which can be constructed using $\mathcal{O}(\frac{nz}{\ell}\log z)$ expected space, and supports very fast pattern matching queries in expectation, for patterns of length $m\geq \ell$. We have implemented and evaluated several versions of our index. The best-performing version of our index is up to two orders of magnitude smaller than the state of the art in terms of both index size and construction space, while offering faster or very competitive query and construction times.

翻译：现实世界中的字符串通常带有一定程度的编码不确定性。在字符级不确定性模型中，一个长度为 $n$、基于字母表 $\Sigma$ 的不确定字符串 $X$ 是一组 $n$ 个概率分布的序列，每个分布覆盖 $\Sigma$。给定一个不确定字符串 $X$ 和一个权重阈值 $\frac{1}{z}\in(0,1]$，我们称模式 $P$ 在位置 $i$ 处出现在 $X$ 中，当且仅当 $P$ 在位置 $i,\ldots,i+|P|-1$ 上的字母概率乘积至少为 $\frac{1}{z}$。尽管在标准字符串上以线性时间和空间进行在线模式搜索的索引是可行的，但为不确定字符串建立索引却更具挑战性。具体而言，现有最先进的不确定字符串索引规模为 $\mathcal{O}(nz)$，构建需要 $\mathcal{O}(nz)$ 时间和 $\mathcal{O}(nz)$ 空间，并以最优时间 $\mathcal{O}(m+|\text{Occ}|)$ 回答模式匹配查询，其中 $m$ 是 $P$ 的长度，$|\text{Occ}|$ 是 $P$ 在 $X$ 中的总出现次数。对于较大的 $n$ 和（适中的）$z$ 值，该索引的构建完全不可行，这削弱了其所支持的最优模式匹配查询的优势。因此，我们致力于设计一种空间高效的索引，以牺牲查询速度为代价，但提供具有竞争力的匹配性能。我们提出了一种预期规模为 $\mathcal{O}(\frac{nz}{\ell}\log z)$ 的索引，其构建所需预期空间为 $\mathcal{O}(\frac{nz}{\ell}\log z)$，并支持对长度 $m\geq \ell$ 的模式进行预期快速模式匹配查询。我们实现并评估了该索引的多个版本。其中性能最佳的版本在索引规模和构建空间上均比现有技术小两个数量级，同时提供更快或极具竞争力的查询与构建时间。