Recent works have shown a surprising result: a small fraction of Large Language Model (LLM) parameter outliers are disproportionately important to the quality of the model. LLMs contain billions of parameters, so these small fractions, such as 0.01%, translate to hundreds of thousands of parameters. In this work, we present an even more surprising finding: Pruning as few as a single parameter can destroy an LLM's ability to generate text -- increasing perplexity by 3 orders of magnitude and reducing zero-shot accuracy to guessing. We propose a data-free method for identifying such parameters, termed super weights, using a single forward pass through the model. We additionally find that these super weights induce correspondingly rare and large activation outliers, termed super activations. When preserved with high precision, super activations can improve simple round-to-nearest quantization to become competitive with state-of-the-art methods. For weight quantization, we similarly find that by preserving the super weight and clipping other weight outliers, round-to-nearest quantization can scale to much larger block sizes than previously considered. To facilitate further research into super weights, we provide an index of super weight coordinates for common, openly available LLMs.
翻译:近期研究揭示了一个令人惊讶的结果:大语言模型(LLM)参数中极小比例的离群值对模型质量具有超乎寻常的重要性。LLM包含数十亿参数,因此即使是0.01%这样微小的比例,也对应着数十万个参数。本研究发现了一个更为惊人的现象:仅剪枝单个参数即可完全破坏LLM的文本生成能力——使困惑度增加三个数量级,并将零样本准确率降低至随机猜测水平。我们提出一种无需数据的方法,仅需对模型执行单次前向传播即可识别此类参数,并将其命名为超级权重。研究还发现这些超级权重会引发相应罕见且幅值巨大的激活离群值,即超级激活。当以高精度保留超级激活时,简单的最近舍入量化方法可达到与最先进方法相当的性能。对于权重量化,我们同样发现:通过保留超级权重并裁剪其他权重离群值,最近舍入量化可扩展至远超以往考虑的块尺寸。为促进超级权重的后续研究,我们提供了常见开源LLM的超级权重坐标索引。