Large Language Models (LLMs) suffer from huge number of parameters, which restricts their deployment on edge devices. Weight sharing is one promising solution that encourages weight reuse, effectively reducing memory usage with less performance drop. However, current weight sharing techniques primarily focus on small-scale models like BERT and employ coarse-grained sharing rules, e.g., layer-wise. This becomes limiting given the prevalence of LLMs and sharing an entire layer or block obviously diminishes the flexibility of weight sharing. In this paper, we present a perspective on head-wise shareable attention for large language models. We further propose two memory-efficient methods that share parameters across attention heads, with a specific focus on LLMs. Both of them use the same dynamic strategy to select the shared weight matrices. The first method directly reuses the pre-trained weights without retraining, denoted as $\textbf{DirectShare}$. The second method first post-trains with constraint on weight matrix similarity and then shares, denoted as $\textbf{PostShare}$. Experimental results reveal our head-wise shared models still maintain satisfactory capabilities, demonstrating the feasibility of fine-grained weight sharing applied to LLMs.
翻译:大语言模型(LLMs)因参数量巨大而难以部署在边缘设备上。权重共享是一种有前景的解决方案,它通过促进权重复用来有效降低内存占用,同时性能下降较小。然而,现有的权重共享技术主要针对 BERT 等小规模模型,并采用粗粒度的共享规则(例如层间共享)。鉴于大语言模型的普及,共享整个层或模块显然限制了权重共享的灵活性。本文提出一种面向大语言模型的注意力头间可共享机制的新视角。我们进一步提出了两种在注意力头之间共享参数的记忆高效方法,并特别针对大语言模型进行设计。这两种方法均采用相同的动态策略来选择共享权重矩阵。第一种方法直接复用预训练权重而无需重新训练,记为 $\textbf{DirectShare}$。第二种方法先对权重矩阵相似性施加约束进行后训练,再进行共享,记为 $\textbf{PostShare}$。实验结果表明,我们提出的头间共享模型仍能保持令人满意的能力,证明了细粒度权重共享应用于大语言模型的可行性。