Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

Are n-gram language models still relevant in this era of neural large language models (LLMs)? Our answer is yes, and we show their values in both text analysis and improving neural LLMs. Yet this necessitates modernizing n-gram models in two aspects. First, we train them at the same data scale as neural LLMs -- 1.4 trillion tokens. This is the largest n-gram model ever built. Second, existing n-gram models use small n which hinders their performance; we instead allow n to be arbitrarily large, by introducing a new $\infty$-gram LM with backoff. Instead of pre-computing n-gram count tables (which would be very expensive), we develop an engine named infini-gram -- powered by suffix arrays -- that can compute $\infty$-gram (as well as n-gram with arbitrary n) probabilities with millisecond-level latency. The $\infty$-gram framework and infini-gram engine enable us to conduct many novel and interesting analyses of human-written and machine-generated text: we find that the $\infty$-gram LM has fairly high accuracy for next-token prediction (47%), and can complement neural LLMs to greatly reduce their language modeling perplexities. When analyzing machine-generated text, we also observe irregularities in the machine--$\infty$-gram agreement level with respect to the suffix length, which indicates deficiencies in neural LLM pretraining and the positional embeddings of Transformers. We open-source our infini-gram engine in the hopes of enabling more study on how to best use verbatim information retrieved from large text corpora.

翻译：n-gram语言模型在神经大语言模型时代是否仍有价值？我们的答案是肯定的，并展示了其在文本分析和改进神经大语言模型两方面的价值。但这要求从两个角度对n-gram模型进行现代化改造：第一，我们在与神经大语言模型相同的数据规模（1.4万亿词元）上训练它们——这是迄今构建的最大规模n-gram模型；第二，现有n-gram模型采用较小的n值导致性能受限，而我们通过引入带退避机制的∞-gram语言模型，允许n取任意大值。不同于预计算n-gram计数表（这将耗费极高成本），我们开发了基于后缀数组的infini-gram引擎，能以毫秒级延迟计算∞-gram（以及任意n的n-gram）概率。该∞-gram框架与infini-gram引擎使我们能够对人类创作文本和机器生成文本开展多项新颖且富有洞见的分析：我们发现∞-gram语言模型在下个词元预测任务中具有相当高的准确率（47%），并能通过互补方式显著降低神经大语言模型的困惑度。在分析机器生成文本时，我们还观察到机器与∞-gram在后续长度维度上的一致性存在异常波动，这揭示了神经大语言模型预训练及Transformer位置编码的缺陷。我们开源了infini-gram引擎，以期促进关于如何最优利用从大规模文本语料中检索的逐字信息的研究。