Frequency-Ordered Tokenization for Better Text Compression

We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law). The method tokenizes text with Byte Pair Encoding (BPE), reorders the vocabulary so that frequent tokens receive small integer identifiers, and encodes the result with variable-length integers before passing it to any standard compressor. On enwik8 (100 MB Wikipedia), this yields improvements of 7.08 percentage points (pp) for zlib, 1.69 pp for LZMA, and 0.76 pp for zstd (all including vocabulary overhead), outperforming the classical Word Replacing Transform. Gains are consistent at 1 GB scale (enwik9) and across Chinese and Arabic text. We further show that preprocessing accelerates compression for computationally expensive algorithms: the total wall-clock time including preprocessing is 3.1x faster than raw zstd-22 and 2.4x faster than raw LZMA, because the preprocessed input is substantially smaller. The method can be implemented in under 50 lines of code.

翻译：本文提出频率排序分词技术，这是一种通过利用自然语言词汇的幂律频率分布（齐普夫定律）来提升无损文本压缩效果的简单预处理方法。该方法使用字节对编码（BPE）对文本进行分词，将词汇表重新排序以使高频词汇获得较小的整数标识符，并在将结果传递给任何标准压缩器之前使用变长整数进行编码。在enwik8（100 MB维基百科数据）上，该方法使zlib的压缩率提升了7.08个百分点（pp），LZMA提升了1.69 pp，zstd提升了0.76 pp（均包含词汇表开销），其性能超越了传统的词替换变换方法。在1 GB规模数据（enwik9）以及中文和阿拉伯文文本上的增益效果保持一致。我们进一步证明，预处理能加速计算密集型压缩算法的运行：包含预处理的总耗时比原始zstd-22快3.1倍，比原始LZMA快2.4倍，这是因为预处理后的输入数据规模显著减小。该方法可在50行代码内实现。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

《深度文本哈希综述：基于二进制表示的高效语义文本检索》

专知会员服务

9+阅读 · 2025年11月3日

文本分类算法及其应用场景研究

专知会员服务

19+阅读 · 2024年7月31日

【ICML2022】Branchformer:并行MLP-Attention架构，捕捉局部和全局上下文，用于语音识别和理解

专知会员服务

25+阅读 · 2022年7月8日

【干货书】使用Python进行时间序列分析，从基础知识到前沿技术，420页pdf

专知会员服务

138+阅读 · 2022年5月21日