The evolving capabilities of large language models are accompanied by growing sizes and deployment costs, necessitating effective inference optimisation techniques. We propose a novel pruning method utilising centrality measures from graph theory, reducing both the computational requirements and the memory footprint of these models. Specifically, we devise a method for creating a weighted directed acyclical graph representation of multilayer perceptrons to which we apply a modified version of the weighted PageRank centrality measure to compute node importance scores. In combination with uniform pruning this leads to structured sparsity. We call this pruning method MLPRank. Furthermore we introduce an extension to decoder-only transformer models and call it LLMRank. For both variants we demonstrate a strong performance. With MLPRank on average leading to 6.09 % higher accuracy retention than three popular baselines and 13.42 % with LLMRank compared to two popular baselines. Code is available at https://github.com/amazon-science/llm-rank-pruning.
翻译:大型语言模型能力的演进伴随着模型规模和部署成本的不断增长,这迫切需要有效的推理优化技术。我们提出了一种新颖的剪枝方法,该方法利用图论中的中心性度量,旨在同时降低这些模型的计算需求和内存占用。具体而言,我们设计了一种方法,用于创建多层感知机的加权有向无环图表示,并对其应用改进版的加权PageRank中心性度量来计算节点重要性分数。结合均匀剪枝,该方法可产生结构化稀疏性。我们将此剪枝方法称为MLPRank。此外,我们将其扩展至仅解码器Transformer模型,并称之为LLMRank。对于这两种变体,我们都展示了其强大的性能。平均而言,MLPRank在准确率保留上比三种流行基线高出6.09%,而LLMRank相比两种流行基线则高出13.42%。代码可在 https://github.com/amazon-science/llm-rank-pruning 获取。