Motivated in part by their relevance for low-precision training and quantization, massive activations in large language models (LLMs) have recently emerged as a topic of interest. However, existing analyses are limited in scope, and generalizability across architectures is unclear. This paper helps address some of these gaps by conducting an analysis of massive activations across a broad range of LLMs, including both GLU-based and non-GLU-based architectures. Our findings challenge several prior assumptions, most importantly: (1) not all massive activations are detrimental, i.e. suppressing them does not lead to an explosion of perplexity or a collapse in downstream task performance; (2) proposed mitigation strategies such as Attention KV bias are model-specific and ineffective in certain cases. We consequently investigate novel hybrid mitigation strategies; in particular pairing Target Variance Rescaling (TVR) with Attention KV bias or Dynamic Tanh (DyT) successfully balances the mitigation of massive activations with preserved downstream model performance in the scenarios we investigated. Our code is available at: https://github.com/bluorion-com/refine_massive_activations.
翻译:部分受到低精度训练与量化相关性的驱动,大型语言模型(LLMs)中的大规模激活现象近期已成为研究热点。然而,现有分析在范围上存在局限,且其在不同架构间的普适性尚不明确。本文通过分析包括基于GLU与非基于GLU架构在内的广泛LLMs中的大规模激活,有助于填补部分研究空白。我们的发现挑战了先前若干假设,其中最重要的是:(1)并非所有大规模激活都是有害的,即抑制它们并不会导致困惑度爆炸或下游任务性能崩溃;(2)已提出的缓解策略(如Attention KV偏置)具有模型特异性,在某些情况下无效。因此,我们研究了新型混合缓解策略;特别地,将目标方差重缩放(TVR)与Attention KV偏置或动态Tanh(DyT)相结合,在我们研究的场景中成功实现了大规模激活缓解与下游模型性能保持之间的平衡。我们的代码发布于:https://github.com/bluorion-com/refine_massive_activations。