We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law). The method tokenizes text with Byte Pair Encoding (BPE), reorders the vocabulary so that frequent tokens receive small integer identifiers, and encodes the result with variable-length integers before passing it to any standard compressor. On enwik8 (100 MB Wikipedia), this yields improvements of 7.08 percentage points (pp) for zlib, 1.69 pp for LZMA, and 0.76 pp for zstd (all including vocabulary overhead), outperforming the classical Word Replacing Transform. Gains are consistent at 1 GB scale (enwik9) and across Chinese and Arabic text. We further show that preprocessing accelerates compression for computationally expensive algorithms: the total wall-clock time including preprocessing is 3.1x faster than raw zstd-22 and 2.4x faster than raw LZMA, because the preprocessed input is substantially smaller. The method can be implemented in under 50 lines of code.
翻译:本文提出频率排序分词技术,这是一种通过利用自然语言词汇的幂律频率分布(齐普夫定律)来提升无损文本压缩效果的简单预处理方法。该方法使用字节对编码(BPE)对文本进行分词,将词汇表重新排序以使高频词汇获得较小的整数标识符,并在将结果传递给任何标准压缩器之前使用变长整数进行编码。在enwik8(100 MB维基百科数据)上,该方法使zlib的压缩率提升了7.08个百分点(pp),LZMA提升了1.69 pp,zstd提升了0.76 pp(均包含词汇表开销),其性能超越了传统的词替换变换方法。在1 GB规模数据(enwik9)以及中文和阿拉伯文文本上的增益效果保持一致。我们进一步证明,预处理能加速计算密集型压缩算法的运行:包含预处理的总耗时比原始zstd-22快3.1倍,比原始LZMA快2.4倍,这是因为预处理后的输入数据规模显著减小。该方法可在50行代码内实现。