Efficient terabyte-scale text compression via stable local consistency and parallel grammar processing

We present a highly parallelizable text compression algorithm that scales efficiently to terabyte-sized datasets. Our method builds on locally consistent grammars, a lightweight form of compression, combined with simple recompression techniques to achieve further space reductions. Locally consistent grammar algorithms are particularly suitable for scaling, as they need minimal satellite information to compact the text. We introduce a novel concept to enable parallelisation, stable local consistency. A grammar algorithm ALG is stable, if for any pattern $P$ occurring in a collection $\mathcal{T}=\{T_1, T_2, \ldots, T_k\}$, the instances $ALG(T_1), ALG(T_2), \ldots, ALG(T_k)$ independently produce cores for $P$ with the same topology. In a locally consistent grammar, the core of $P$ is a subset of nodes and edges in $\mathcal{T}$'s parse tree that remains the same in all the occurrences of $P$. This feature is important to achieve compression, but it only holds if ALG synchronises the parsing of the strings, for instance, by defining a common set of nonterminal symbols for them. Stability removes the need for synchronisation during the parsing phase. Consequently, we can run $ALG(T_1), ALG(T_2), \ldots, ALG(T_k)$ fully in parallel and then merge the resulting grammars into a single compressed output equivalent to $ALG(\mathcal{T})$. We implemented our ideas and tested them on massive datasets. Our results showed that our method could process a diverse collection of bacterial genomes (7.9 TB) in around nine hours, requiring 16 threads and 0.43 bits/symbol of working memory, producing a compressed representation 85 times smaller than the original input.

翻译：本文提出一种高度可并行的文本压缩算法，可高效扩展至太字节规模的数据集。该方法建立在局部一致语法（一种轻量级压缩形式）的基础上，结合简单的再压缩技术以实现进一步的存储空间缩减。局部一致语法算法特别适用于大规模扩展，因其仅需极少的辅助信息即可压缩文本。我们引入了一个实现并行化的新概念——稳定局部一致性。若对于集合 $\mathcal{T}=\{T_1, T_2, \ldots, T_k\}$ 中出现的任意模式 $P$，其实例 $ALG(T_1), ALG(T_2), \ldots, ALG(T_k)$ 能独立为 $P$ 生成具有相同拓扑结构的核心，则称该语法算法 ALG 是稳定的。在局部一致语法中，$P$ 的核心是 $\mathcal{T}$ 解析树中节点与边的子集，该子集在 $P$ 的所有出现中保持不变。这一特性对实现压缩至关重要，但仅当 ALG 对字符串解析进行同步（例如为其定义公共的非终结符集合）时才成立。稳定性消除了解析阶段对同步的需求。因此，我们可以完全并行地运行 $ALG(T_1), ALG(T_2), \ldots, ALG(T_k)$，然后将生成的语法合并为与 $ALG(\mathcal{T})$ 等效的单一压缩输出。我们实现了该算法并在海量数据集上进行了测试。实验结果表明，我们的方法能在约九小时内处理包含多种细菌基因组的集合（7.9 TB），仅需 16 个线程和 0.43 比特/符号的工作内存，生成的压缩表示比原始输入小 85 倍。