The Burrows-Wheeler Transform (BWT) is a string transformation technique widely used in areas such as bioinformatics and file compression. Many applications combine a run-length encoding (RLE) with the BWT in a way which preserves the ability to query the compressed data efficiently. However, these methods may not take full advantage of the compressibility of the BWT as they do not modify the alphabet ordering for the sorting step embedded in computing the BWT. Indeed, any such alteration of the alphabet ordering can have a considerable impact on the output of the BWT, in particular on the number of runs. For an alphabet $\Sigma$ containing $\sigma$ characters, the space of all alphabet orderings is of size $\sigma!$. While for small alphabets an exhaustive investigation is possible, finding the optimal ordering for larger alphabets is not feasible. Therefore, there is a need for a more informed search strategy than brute-force sampling the entire space, which motivates a new heuristic approach. In this paper, we explore the non-trivial cases for the problem of minimizing the size of a run-length encoded BWT (RLBWT) via selecting a new ordering for the alphabet. We show that random sampling of the space of alphabet orderings usually gives sub-optimal orderings for compression and that a local search strategy can provide a large improvement in relatively few steps. We also inspect a selection of initial alphabet orderings, including ASCII, letter appearance, and letter frequency. While this alphabet ordering problem is computationally hard we demonstrate gain in compressibility.
翻译:Burrows-Wheeler变换(BWT)是一种广泛应用于生物信息学和文件压缩等领域的字符串变换技术。许多应用将游程编码(RLE)与BWT相结合,以保持对压缩数据的高效查询能力。然而,这些方法可能无法充分利用BWT的可压缩性,因为它们并未修改计算BWT时嵌入的排序步骤中的字母表排序。事实上,对字母表排序的任何改变都会对BWT的输出产生显著影响,特别是对游程数量。对于包含$\sigma$个字符的字母表$\Sigma$,所有可能字母表排序的空间大小为$\sigma!$。虽然对于小字母表可以进行穷举搜索,但对于较大字母表,找到最优排序并不可行。因此,需要一种比暴力搜索整个空间更具信息性的搜索策略,这启发了一种新的启发式方法。在本文中,我们探讨了通过选择新的字母表排序来最小化游程编码BWT(RLBWT)尺寸的非平凡情况。研究表明,对字母表排序空间进行随机采样通常会产生次优的压缩排序,而局部搜索策略可以在相对较少的步骤中提供大幅改进。我们还考察了一系列初始字母表排序,包括ASCII顺序、字符出现顺序和字符频率顺序。尽管这个字母表排序问题在计算上具有难度,但我们证明了压缩性的提升。
Alphabet is mostly a collection of companies. This newer Google is a bit slimmed down, with the companies that are pretty far afield of our main internet products contained in Alphabet instead.https://abc.xyz/