We present a new semi-external algorithm that builds the Burrows--Wheeler transform variant of Bauer et al. (a.k.a., BCR BWT) in linear expected time. Our method uses compression techniques to reduce computational costs when the input is massive and repetitive. Concretely, we build on induced suffix sorting (ISS) and resort to run-length and grammar compression to maintain our intermediate results in compact form. Our compression format not only saves space but also speeds up the required computations. Our experiments show important space and computation time savings when the text is repetitive. In moderate-size collections of real human genome assemblies (14.2 GB - 75.05 GB), our memory peak is, on average, 1.7x smaller than the peak of the state-of-the-art BCR BWT construction algorithm (\texttt{ropebwt2}), while running 5x faster. Our current implementation was also able to compute the BCR BWT of 400 real human genome assemblies (1.2 TB) in 41.21 hours using 118.83 GB of working memory (around 10\% of the input size). Interestingly, the results we report in the 1.2 TB file are dominated by the difficulties of scanning huge files under memory constraints (specifically, I/O operations). This fact indicates we can perform much better with a more careful implementation of our method, thus scaling to even bigger sizes efficiently.
翻译:我们提出一种新的半外部算法,能在线性期望时间内构建Bauer等人的Burrows-Wheeler变换变体(即BCR BWT)。该方法利用压缩技术降低大规模重复输入数据时的计算成本。具体而言,我们基于诱导后缀排序(ISS),并采用游程编码与文法压缩以紧凑形式维护中间结果。该压缩格式不仅节省存储空间,还能加速所需计算。实验表明,当文本具有重复性时,我们的方法在空间与计算时间上均有显著优化。在中等规模的真实人类基因组组装数据集合(14.2 GB–75.05 GB)中,我们的内存峰值平均比现有最优BCR BWT构建算法(\texttt{ropebwt2})低1.7倍,同时运行速度快5倍。当前实现还可在41.21小时内,使用118.83 GB工作内存(约为输入大小的10%)完成400个真实人类基因组组装数据(1.2 TB)的BCR BWT计算。值得注意的是,我们在处理1.2 TB文件时报告的结果主要受限于内存约束下扫描超大文件的困难(具体为I/O操作)。这一事实表明,若对方法进行更精细的实现,可进一步显著提升性能,从而高效扩展至更大规模数据。