We observe an empirical phenomenon in Large Language Models (LLMs) -- very few activations exhibit significantly larger values than others (e.g., 100,000 times larger). We call them massive activations. First, we demonstrate the widespread existence of massive activations across various LLMs and characterize their locations. Second, we find their values largely stay constant regardless of the input, and they function as indispensable bias terms in LLMs. Third, these massive activations lead to the concentration of attention probabilities to their corresponding tokens, and further, implicit bias terms in the self-attention output. Last, we also study massive activations in Vision Transformers. Code is available at https://github.com/locuslab/massive-activations.
翻译:我们在大语言模型(LLMs)中观察到一个经验现象——极少数激活值会呈现出远大于其他激活值的数值(例如,高出100,000倍)。我们将其称为大规模激活。首先,我们证明了大规模激活在不同LLMs中普遍存在,并刻画了其分布位置。其次,我们发现这些激活值的大小基本不随输入变化而改变,它们在LLMs中发挥着不可或缺的偏置项作用。第三,这些大规模激活会导致注意力概率集中到其对应的词元上,并进一步在自注意力输出中形成隐式偏置项。最后,我们也研究了视觉Transformer中的大规模激活现象。代码发布于 https://github.com/locuslab/massive-activations。