The large number of parameters in Pretrained Language Models enhance their performance, but also make them resource-intensive, making it challenging to deploy them on commodity hardware like a single GPU. Due to the memory and power limitations of these devices, model compression techniques are often used to decrease both the model's size and its inference latency. This usually results in a trade-off between model accuracy and efficiency. Therefore, optimizing this balance is essential for effectively deploying LLMs on commodity hardware. A significant portion of the efficiency challenge is the Feed-forward network (FFN) component, which accounts for roughly $\frac{2}{3}$ total parameters and inference latency. In this paper, we first observe that only a few neurons of FFN module have large output norm for any input tokens, a.k.a. heavy hitters, while the others are sparsely triggered by different tokens. Based on this observation, we explicitly split the FFN into two parts according to the heavy hitters. We improve the efficiency-accuracy trade-off of existing compression methods by allocating more resource to FFN parts with heavy hitters. In practice, our method can reduce model size by 43.1\% and bring $1.25\sim1.56\times$ wall clock time speedup on different hardware with negligible accuracy drop.
翻译:预训练语言模型中的大量参数提升了其性能,但也使其资源密集,难以在单个GPU等商用硬件上部署。受限于这些设备的内存和功耗,常采用模型压缩技术来减小模型尺寸并降低推理延迟,但这通常会导致模型准确率与效率之间的权衡。因此,优化这一平衡对于在商用硬件上有效部署大语言模型至关重要。效率挑战的主要来源之一是前馈网络组件,其约占模型总参数和推理延迟的三分之二。本文首先观察到,FFN模块中仅有少数神经元对任何输入标记都具有较大的输出范数(即"重击者"),而其他神经元则被不同标记稀疏地触发。基于这一发现,我们根据重击者将FFN显式分割为两部分。通过将更多资源分配给含有重击者的FFN部分,我们改进了现有压缩方法的效率-准确率权衡。在实际应用中,我们的方法可减少模型尺寸43.1%,并在不同硬件上实现1.25至1.56倍的时钟时间加速,同时准确率下降可忽略不计。