The Burrows-Wheeler transform (BWT) is a reversible text transformation used extensively in text compression, indexing, and bioinformatics, particularly in the alignment of short reads. However, constructing the BWT for long strings poses significant challenges. We introduce a novel approach to partition a long string into shorter substrings, enabling the use of multi-string BWT construction algorithms to process these inputs. The approach partitions based on a prefix of the suffix array and we provide an implementation for DNA sequences. Through comparison with state-of-the-art BWT construction algorithms, we demonstrate a speed improvement of approximately 12% on a real genome dataset consisting of 3.2 billion characters. The proposed partitioning strategy is applicable to strings of any alphabet.
翻译:Burrows-Wheeler变换(BWT)是一种可逆的文本转换技术,广泛应用于文本压缩、索引构建及生物信息学领域,特别是在短读段序列比对中具有重要作用。然而,针对长字符串构建BWT面临显著挑战。本文提出一种创新方法,将长字符串分割为较短的子串,从而能够运用多字符串BWT构建算法处理这些输入。该方法基于后缀数组的前缀进行分割,并提供了针对DNA序列的具体实现。通过与最先进的BWT构建算法进行比较,我们在一个包含32亿字符的真实基因组数据集上实现了约12%的速度提升。所提出的分割策略适用于任意字母表的字符串。