The purpose of this article is twofold. Firstly, we use the next-token probabilities given by a language model to explicitly define a $[0,1]$-enrichment of a category of texts in natural language, in the sense of Bradley, Terilla, and Vlassopoulos. We consider explicitly the terminating conditions for text generation and determine when the enrichment itself can be interpreted as a probability over texts. Secondly, we compute the M\"obius function and the magnitude of an associated generalized metric space $\mathcal{M}$ of texts using a combinatorial version of these quantities recently introduced by Vigneaux. The magnitude function $f(t)$ of $\mathcal{M}$ is a sum over texts $x$ (prompts) of the Tsallis $t$-entropies of the next-token probability distributions $p(-|x)$ plus the cardinality of the model's possible outputs. The derivative of $f$ at $t=1$ recovers a sum of Shannon entropies, which justifies seeing magnitude as a partition function. Following Leinster and Schulman, we also express the magnitude function of $\mathcal M$ as an Euler characteristic of magnitude homology and provide an explicit description of the zeroeth and first magnitude homology groups.
翻译:本文目的有二。首先,我们利用语言模型给出的下一词元概率,在Bradley、Terilla与Vlassopoulos的意义下,显式定义自然语言文本范畴的$[0,1]$-增强结构。我们显式考察文本生成的终止条件,并确定该增强结构本身何时可被解释为文本上的概率测度。其次,我们运用Vigneaux近期提出的组合版本量值,计算关联文本广义度量空间$\mathcal{M}$的默比乌斯函数与幅度。空间$\mathcal{M}$的幅度函数$f(t)$可表述为:对所有文本$x$(提示)对应的下一词元概率分布$p(-|x)$之Tsallis $t$-熵求和,再加上模型可能输出的基数。$f$在$t=1$处的导数恢复为香农熵之和,这支持将幅度视为配分函数的观点。遵循Leinster与Schulman的研究进路,我们还将$\mathcal M$的幅度函数表述为幅度同调的欧拉示性数,并给出零阶与一阶幅度同调群的显式描述。