Quantifying Emergence in Large Language Models

Emergence, broadly conceptualized as the ``intelligent'' behaviors of LLMs, has recently been studied and proved challenging to quantify due to the lack of a measurable definition. Most commonly, it has been estimated statistically through model performances across extensive datasets and tasks, which consumes significant resources. In addition, such estimation is difficult to interpret and may not accurately reflect the models' intrinsic emergence. In this work, we propose a quantifiable solution for estimating emergence. Inspired by emergentism in dynamics, we quantify the strength of emergence by comparing the entropy reduction of the macroscopic (semantic) level with that of the microscopic (token) level, both of which are derived from the representations within the transformer block. Using a low-cost estimator, our quantification method demonstrates consistent behaviors across a suite of LMs (GPT-2, GEMMA, etc.) under both in-context learning (ICL) and natural sentences. Empirical results show that (1) our method gives consistent measurements which align with existing observations based on performance metrics, validating the effectiveness of our emergence quantification; (2) our proposed metric uncovers novel emergence patterns such as the correlations between the variance of our metric and the number of ``shots'' in ICL, which further suggests a new way of interpreting hallucinations in LLMs; (3) we offer a potential solution towards estimating the emergence of larger and closed-resource LMs via smaller LMs like GPT-2. Our codes are available at: https://github.com/Zodiark-ch/Emergence-of-LLMs/.

翻译：涌现，广义上被理解为大型语言模型（LLM）的“智能”行为，近年来虽被广泛研究，但由于缺乏可量化的定义而难以度量。目前，最常用的方法是通过模型在大量数据集和任务上的性能进行统计估计，这不仅消耗大量资源，而且此类估计结果难以解释，且可能无法准确反映模型的内在涌现现象。本文提出了一种可量化的涌现估计方案。受动力学中的涌现主义启发，我们通过比较宏/语义层次与微/词元层次的信息熵减少强度来量化涌现，其中两种层次的信息熵都源自Transformer模块内的表征。借助低资源消耗的估计器，我们的量化方法在一系列语言模型（如GPT-2、GEMMA等）的上下文学习与自然语句场景下均表现出稳定的一致性。实验结果表明：（1）本方法提供的稳定测量结果与基于性能指标的已有观测结论相符，验证了涌现量化方法的有效性；（2）本文提出的度量指标揭示了新颖的涌现模式，例如该度量值的方差与上下文学习中的“样本数”存在相关性，这为解释LLM的幻觉现象提供了新思路；（3）我们提供了一种通过小型语言模型（如GPT-2）估计更大规模或资源受限型LLM涌现现象的潜在解决方案。我们的代码开源在：https://github.com/Zodiark-ch/Emergence-of-LLMs/。