Learning-based probabilistic models can be combined with an entropy coder for data compression. However, due to the high complexity of learning-based models, their practical application as text compressors has been largely overlooked. To address this issue, our work focuses on a low-complexity design while maintaining compression performance. We introduce a novel Learned Lossless Low-complexity Text Compression method (L3TC). Specifically, we conduct extensive experiments demonstrating that RWKV models achieve the fastest decoding speed with a moderate compression ratio, making it the most suitable backbone for our method. Second, we propose an outlier-aware tokenizer that uses a limited vocabulary to cover frequent tokens while allowing outliers to bypass the prediction and encoding. Third, we propose a novel high-rank reparameterization strategy that enhances the learning capability during training without increasing complexity during inference. Experimental results validate that our method achieves 48% bit saving compared to gzip compressor. Besides, L3TC offers compression performance comparable to other learned compressors, with a 50x reduction in model parameters. More importantly, L3TC is the fastest among all learned compressors, providing real-time decoding speeds up to megabytes per second. Our code is available at https://github.com/alipay/L3TC-leveraging-rwkv-for-learned-lossless-low-complexity-text-compression.git.
翻译:基于学习的概率模型可与熵编码器结合用于数据压缩。然而,由于学习模型的高复杂度,其作为文本压缩器的实际应用长期被忽视。为解决该问题,本研究聚焦于在保持压缩性能的同时实现低复杂度设计。我们提出了一种新颖的学习型无损低复杂度文本压缩方法(L3TC)。具体而言,我们通过大量实验证明RWKV模型能以中等压缩比实现最快的解码速度,使其成为本方法最合适的基础架构。其次,我们提出一种异常感知分词器,使用有限词汇表覆盖高频词元,同时允许异常词元绕过预测与编码流程。第三,我们提出一种新颖的高秩重参数化策略,在训练阶段增强学习能力,且不增加推理阶段的复杂度。实验结果表明,本方法相比gzip压缩器可实现48%的比特节省率。此外,L3TC的压缩性能可与其他学习型压缩器相媲美,同时模型参数量减少50倍。更重要的是,L3TC在所有学习型压缩器中解码速度最快,可实现高达兆字节每秒的实时解码速度。代码已开源:https://github.com/alipay/L3TC-leveraging-rwkv-for-learned-lossless-low-complexity-text-compression.git。