The Positional Burrows--Wheeler Transform (PBWT) is a data structure designed for efficiently representing and querying large collections of sequences, such as haplotype panels in genomics. Forward and backward stepping operations -- analogues to LF- and FL-mapping in the traditional BWT -- are fundamental to the PBWT, underpinning many algorithms based on the PBWT for haplotype matching and related analyses. Although the run-length encoded variant of the PBWT (also known as the $μ$-PBWT) achieves $O(\newR)$-word space usage, where $\newR$ is the total number of runs, no data structure supporting both forward and backward stepping in constant time within this space bound was previously known. In this paper, we consider the multi-allelic PBWT that is extended from its original binary form to a general ordered alphabet $\{0, \dots, σ-1\}$. We first establish bounds on the size $\newR$ and then introduce a new $O(\newR)$-word data structure built over a list of haplotypes $\{S_1, \dots, S_\height\}$, each of length $\width$, that supports constant-time forward and backward stepping. We further revisit two key applications -- haplotype retrieval and prefix search -- leveraging our efficient forward stepping technique. Specifically, we design an $O(\newR)$-word space data structure that supports haplotype retrieval in $O(\log \log_{\word} h + \width)$ time. For prefix search, we present an $O(\height + \newR)$-word data structure that answers queries in $O(m' \log\log_{\word} σ+ \occ)$ time, where $m'$ denotes the length of the longest common prefix returned and $\occ$ denotes the number of haplotypes prefixed the longest prefix.
翻译:位置Burrows-Wheeler变换(PBWT)是一种为高效表示和查询大规模序列集合(如基因组学中的单倍型面板)而设计的数据结构。前向与后向步进操作——类似于传统BWT中的LF映射和FL映射——是PBWT的基础,支撑着许多基于PBWT的单倍型匹配及相关分析算法。尽管游程编码变体的PBWT(亦称为$μ$-PBWT)实现了$O(\newR)$字空间复杂度(其中$\newR$为游程总数),但此前尚未有在该空间约束下支持常数时间双向步进的数据结构。本文研究从原始二元形式扩展至通用有序字母表$\{0, \dots, σ-1\}$的多等位基因PBWT。我们首先建立$\newR$的规模界限,随后提出一种基于单倍型列表$\{S_1, \dots, S_\height\}$(每条长度为$\width$)的新型$O(\newR)$字数据结构,该结构支持常数时间的前向与后向步进。进一步地,我们利用高效的前向步进技术重新审视两个关键应用——单倍型检索与前缀搜索。具体而言,我们设计了支持$O(\log \log_{\word} h + \width)$时间单倍型检索的$O(\newR)$字空间数据结构。对于前缀搜索,我们提出一种$O(\height + \newR)$字数据结构,可在$O(m' \log\log_{\word} σ+ \occ)$时间内响应查询,其中$m'$表示返回的最长公共前缀长度,$\occ$表示以该最长前缀为起始的单倍型数量。