We provide new estimates of an asymptotic upper bound on the entropy of English using the large language model LLaMA-7B as a predictor for the next token given a window of past tokens. This estimate is significantly smaller than currently available estimates in \cite{cover1978convergent}, \cite{lutati2023focus}. A natural byproduct is an algorithm for lossless compression of English text which combines the prediction from the large language model with a lossless compression scheme. Preliminary results from limited experiments suggest that our scheme outperforms state-of-the-art text compression schemes such as BSC, ZPAQ, and paq8h.
翻译:我们利用大语言模型LLaMA-7B作为基于历史窗口的下一个词元预测器,给出了英语熵值渐近上界的新估计。该估计值显著小于现有参考文献\cite{cover1978convergent}和\cite{lutati2023focus}中的估计结果。一个自然的副产品是结合大语言模型预测与无损压缩方案的英语文本无损压缩算法。有限实验的初步结果表明,我们的方案优于BSC、ZPAQ和paq8h等当前最先进的文本压缩方案。