In this paper, we explore the idea of training large language models (LLMs) over highly compressed text. While standard subword tokenizers compress text by a small factor, neural text compressors can achieve much higher rates of compression. If it were possible to train LLMs directly over neurally compressed text, this would confer advantages in training and serving efficiency, as well as easier handling of long text spans. The main obstacle to this goal is that strong compression tends to produce opaque outputs that are not well-suited for learning. In particular, we find that text na\"ively compressed via Arithmetic Coding is not readily learnable by LLMs. To overcome this, we propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length. Using this method, we demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks. While our method delivers worse perplexity than subword tokenizers for models trained with the same parameter count, it has the benefit of shorter sequence lengths. Shorter sequence lengths require fewer autoregressive generation steps, and reduce latency. Finally, we provide extensive analysis of the properties that contribute to learnability, and offer concrete suggestions for how to further improve the performance of high-compression tokenizers.
翻译:本文探讨了在高度压缩文本上训练大语言模型(LLMs)的构想。虽然标准的子词分词器仅能将文本压缩至较小程度,但神经文本压缩器能够实现更高的压缩率。若能够直接在神经压缩文本上训练LLMs,将在训练和服务效率以及长文本处理方面带来优势。实现该目标的主要障碍在于,强压缩往往会产生不透明的输出,不利于模型学习。具体而言,我们发现通过算术编码简单压缩的文本不易被LLMs有效学习。为克服此问题,我们提出一种新颖的压缩技术——等信息窗口,该方法将文本分割为每个块压缩后比特长度相同的区块。使用此方法,我们证明了在神经压缩文本上的有效学习能力随模型规模提升而增强,并在困惑度和推理速度基准测试中显著优于字节级基线方法。虽然与相同参数规模的子词分词器模型相比,我们的方法在困惑度指标上表现稍逊,但其优势在于能产生更短的序列长度。更短的序列长度意味着需要更少的自回归生成步骤,从而降低延迟。最后,我们对影响可学习性的特性进行了深入分析,并就如何进一步提升高压缩分词器的性能提出了具体建议。