The positional Burrows-Wheeler Transform (PBWT) is commonly used to store haplotype panels compactly in such a way that, given a query haplotype, we can quickly find the set maximal exact matches (SMEMs) between the query and the haplotypes in a panel. There are generally two steps in this process: first we find the maximal substrings of the query that occur in the same positions in haplotypes in the panel and then, for each such substring, report the haplotypes in the panel in which the substring occurs in the same position as in the query. Very recently, Bonizzoni, Gagie and Gao (2026) gave two time-space tradeoffs for the second step: they use either $O ((r + h) \log n)$ bits and $O (\log \log \min (h, \ell) + k)$ time to report $k$ haplotypes in the panel, or $O (r \log h + h \log n)$ bits and $O (k \log \log h)$ time, where $r$ is the number of runs in the panel's PBWT and $h$, $\ell$ and $n = h \ell$ are the panel's height, length and size, respectively. We observe here that if we can batch queries until we have found $r \lg (h) / \lg r$ such substrings and we report an average of at least $\lg (r) / \lg h$ haplotypes in the panel per substring, for example, then for the second step we can easily use $O (r \log h)$ bits and constant time to report each haplotype. Our approach is based on an algorithm for constructing the prefix arrays quickly from the PBWT, which may be of independent interest.
翻译:位置Burrows-Wheeler变换(PBWT)通常用于紧凑存储单倍型面板,使得给定一个查询单倍型时,我们能快速找到该查询与面板中单倍型之间的集合最大精确匹配(SMEMs)。该过程通常包含两个步骤:首先找出查询中在面板内单倍型相同位置上出现的最大子串,然后针对每个这样的子串,报告面板中在该位置出现该子串的单倍型。近期,Bonizzoni、Gagie和Gao(2026)针对第二步提出了两种时间-空间权衡方案:一种使用$O ((r + h) \log n)$比特空间和$O (\log \log \min (h, \ell) + k)$时间报告面板中$k$个单倍型;另一种使用$O (r \log h + h \log n)$比特空间和$O (k \log \log h)$时间,其中$r$是面板PBWT中的游程数,$h$、$\ell$和$n = h \ell$分别代表面板的高度、长度和大小。我们在此观察到,如果能将查询批处理至找到$r \lg (h) / \lg r$个此类子串,且每个子串平均至少报告$\lg (r) / \lg h$个面板中的单倍型(例如),那么对于第二步,我们可使用$O (r \log h)$比特空间和常数时间报告每个单倍型。我们的方法基于一种从PBWT快速构建前缀数组的算法,该算法本身可能具有独立的研究价值。