The Burrows-Wheeler transform (BWT) is a reversible transform that converts a string $w$ into another string $\mathsf{BWT}(w)$. The size of the run-length encoded BWT (RLBWT) can be interpreted as a measure of repetitiveness in the class of representations called dictionary compression which are essentially representations based on copy and paste operations. In this paper, we shed new light on the compressiveness of BWT and the bijective BWT (BBWT). We first extend previous results on the relations of their run-length compressed sizes $r$ and $r_B$. We also show that the so-called ``clustering effect'' of BWT and BBWT can be captured by measures other than empirical entropy or run-length encoding. In particular, we show that BWT and BBWT do not increase the repetitiveness of the string with respect to various measures based on dictionary compression by more than a polylogarithmic factor. Furthermore, we show that there exists an infinite family of strings that are maximally incompressible by any dictionary compression measure, but become very compressible after applying BBWT. An interesting implication of this result is that it is possible to transcend dictionary compression in some cases by simply applying BBWT before applying dictionary compression.
翻译:Burrows-Wheeler变换(BWT)是一种可逆变换,它将字符串$w$转换为另一个字符串$\mathsf{BWT}(w)$。游程编码BWT(RLBWT)的大小可被解释为一种基于字典压缩(本质上是通过复制粘贴操作实现的表示方法)的表示类别中重复性的度量。本文对BWT及双射BWT(BBWT)的压缩性提出了新的见解。我们首先扩展了先前关于其游程压缩大小$r$与$r_B$之间关系的研究结果。我们还证明,BWT与BBWT的所谓“聚类效应”可以通过经验熵或游程编码以外的度量来捕捉。具体而言,我们证明BWT与BBWT不会使字符串相对于多种基于字典压缩的度量所定义的重复性增加超过一个多对数因子。此外,我们证明存在一个无限字符串族,它们在任何字典压缩度量下都是最大程度不可压缩的,但在应用BBWT后变得高度可压缩。这一结果的一个有趣启示是,在某些情况下,仅需在应用字典压缩前先应用BBWT,就有可能超越字典压缩的极限。