Autoregressive language models (LMs) map token sequences to probabilities. The usual practice for computing the probability of any character string (e.g. English sentences) is to first transform it into a sequence of tokens that is scored by the model. However, there are exponentially many token sequences that represent any given string. To truly compute the probability of a string one should marginalize over all tokenizations, which is typically intractable. Here, we analyze whether the practice of ignoring the marginalization is justified. To this end, we devise an importance-sampling-based algorithm that allows us to compute estimates of the marginal probabilities and compare them to the default procedure in a range of state-of-the-art models and datasets. Our results show that the gap in log-likelihood is no larger than 0.5% in most cases, but that it becomes more pronounced for data with long complex words.
翻译:自回归语言模型(LM)将token序列映射为概率。计算任意字符串(例如英语句子)概率的常规做法是,首先将其转换为模型评分的token序列。然而,表示任何给定字符串的token序列数量呈指数级增长。要真正计算字符串的概率,应该对所有token化进行边际化处理,但这通常是难以实现的。本文分析了忽略边际化处理的实践是否合理。为此,我们设计了一种基于重要性采样的算法,能够计算边际概率的估计值,并与一系列最先进模型和数据集中的默认流程进行比较。结果表明,在大多数情况下,对数似然的差异不超过0.5%,但对于包含长复合词的数据,这种差异变得更加显著。