In this paper, we establish the novel insight that an off-the-shelf LLM can function as an excellent token compressor and decompressor. To demonstrate, we design a self-expressive autoencoding learning framework fine-tunes a pretrained LLM to translate long texts into a compact internal language of discrete, variable-length latent codes, termed Z-tokens, and to reconstruct the original text exactly from them. The resulting representation is content-adaptive: semantically dense segments receive more Z-tokens, while redundant or predictable regions are aggressively compressed, via lightweight LoRA-based adapter heads. Empirically, our method achieves up to 18 times token reduction on Wikipedia, CNN/DailyMail, HotpotQA, and Qulac-style long-query datasets, while preserving reconstruction fidelity and downstream performance. This simple yet effective design supports applications including prompt compression and autoregressive generation directly in the Z-token space, offering a potential pathway toward token-efficient long-context reasoning.
翻译:本文提出一个新颖观点:现有的通用大型语言模型(LLM)可充当卓越的令牌压缩器与解压器。为验证此观点,我们设计了一种自表达自编码学习框架,通过微调预训练LLM将长文本转化为一种由离散、可变长度潜变量编码(称为Z-tokens)构成的紧凑内部语言,并据此精确重建原始文本。所得表征具有内容自适应性:语义密集的片段获得更多Z-tokens,而冗余或可预测区域则通过轻量级LoRA适配器头进行激进压缩。实验表明,在Wikipedia、CNN/DailyMail、HotpotQA及Qulac式长查询数据集上,本方法实现最高18倍的令牌压缩,同时保持重建保真度与下游任务性能。这一简洁高效的设计支持提示压缩及直接在Z-token空间中进行自回归生成,为令牌高效的超长上下文推理开辟了新途径。