The Burrows-Wheeler Transform (BWT) is a fundamental component in many data structures for text indexing and compression, widely used in areas such as bioinformatics and information retrieval. The extended BWT (eBWT) generalizes the classical BWT to multisets of strings, providing a flexible framework that captures many BWT-like constructions. Several known variants of the BWT can be viewed as instances of the eBWT applied to specific decompositions of a word. A central property of the BWT, essential for its compressibility, is the number of maximal ranges of equal letters, named runs. In this article, we explore how different decompositions of a word impact the number of runs in the resulting eBWT. First, we show that the number of decompositions of a word is exponential, even under minimal constraints on the size of the subsets in the decomposition. Second, we present an infinite family of words for which the ratio of the number of runs between the worst and best decompositions is unbounded, under the same minimal constraints. These results illustrate the potential cost of decomposition choices in eBWT-based compression and underline the challenges in optimizing run-length encoding in generalized BWT frameworks.
翻译:Burrows-Wheeler变换(BWT)是文本索引与压缩领域中许多数据结构的基础组件,广泛应用于生物信息学与信息检索等领域。扩展BWT(eBWT)将经典BWT推广至字符串多重集,提供了一个能涵盖多种类BWT构造的灵活框架。若干已知的BWT变体可视为eBWT应用于单词特定分解的实例。BWT的一个核心特性——对其可压缩性至关重要——是等字母最大连续区间的数量,即游程数。本文探讨单词的不同分解如何影响其对应eBWT的游程数。首先,我们证明即使在分解子集规模的最小约束下,单词的分解数量仍呈指数级增长。其次,我们在相同最小约束条件下,构造了一个无限单词族,其最差分解与最优分解的游程数比值是无界的。这些结果揭示了基于eBWT的压缩中分解选择可能带来的潜在代价,并凸显了在广义BWT框架中优化游程编码所面临的挑战。