Transformer-based large language models (LLMs) are comprised of billions of parameters arranged in deep and wide computational graphs. Several studies on LLM efficiency optimization argue that it is possible to prune a significant portion of the parameters, while only marginally impacting performance. This suggests that the computation is not uniformly distributed across the parameters. We introduce here a technique to systematically quantify computation density in LLMs. In particular, we design a density estimator drawing on mechanistic interpretability. We experimentally test our estimator and find that: (1) contrary to what has been often assumed, LLM processing generally involves dense computation; (2) computation density is dynamic, in the sense that models shift between sparse and dense processing regimes depending on the input; (3) per-input density is significantly correlated across LLMs, suggesting that the same inputs trigger either low or high density. Investigating the factors influencing density, we observe that predicting rarer tokens requires higher density, and increasing context length often decreases the density. We believe that our computation density estimator will contribute to a better understanding of the processing at work in LLMs, challenging their symbolic interpretation.
翻译:基于Transformer的大型语言模型由数十亿参数组成,这些参数排列在深度和广度兼具的计算图中。多项关于LLM效率优化的研究认为,可以在仅轻微影响性能的前提下剪枝相当一部分参数。这表明计算并非均匀分布在所有参数中。本文提出一种系统量化LLM计算密度的技术。具体而言,我们借鉴机制可解释性原理设计了一种密度估计器。通过实验测试发现:(1)与通常假设相反,LLM处理通常涉及密集计算;(2)计算密度具有动态性,即模型会根据输入在稀疏与密集处理模式间切换;(3)不同LLM对同一输入的密度估计呈现显著相关性,表明特定输入会普遍触发低密度或高密度计算。通过研究影响密度的因素,我们观察到预测罕见词元需要更高密度,而增加上下文长度通常会降低密度。我们相信该计算密度估计器将有助于深入理解LLM的内部处理机制,并对符号化解释理论提出挑战。