In order to reduce the computational complexity of large language models, great efforts have been made to to improve the efficiency of transformer models such as linear attention and flash-attention. However, the model size and corresponding computational complexity are constantly scaled up in pursuit of higher performance. In this work, we present MemoryFormer, a novel transformer architecture which significantly reduces the computational complexity (FLOPs) from a new perspective. We eliminate nearly all the computations of the transformer model except for the necessary computation required by the multi-head attention operation. This is made possible by utilizing an alternative method for feature transformation to replace the linear projection of fully-connected layers. Specifically, we first construct a group of in-memory lookup tables that store a large amount of discrete vectors to replace the weight matrix used in linear projection. We then use a hash algorithm to retrieve a correlated subset of vectors dynamically based on the input embedding. The retrieved vectors combined together will form the output embedding, which provides an estimation of the result of matrix multiplication operation in a fully-connected layer. Compared to conducting matrix multiplication, retrieving data blocks from memory is a much cheaper operation which requires little computations. We train MemoryFormer from scratch and conduct extensive experiments on various benchmarks to demonstrate the effectiveness of the proposed model.
翻译:为降低大型语言模型的计算复杂度,学界已投入大量努力改进Transformer模型的效率,例如线性注意力与闪存注意力机制。然而,为追求更高性能,模型规模及对应计算复杂度仍在持续扩大。本文提出MemoryFormer——一种从全新视角显著降低计算复杂度(浮点运算次数)的新型Transformer架构。我们几乎消除了Transformer模型中除多头注意力操作所必需计算之外的全部运算,这是通过采用替代性特征变换方法取代全连接层的线性投影实现的。具体而言,我们首先构建一组内存查找表,其中存储大量离散向量以替代线性投影中使用的权重矩阵;随后基于输入嵌入动态检索相关向量子集。检索得到的向量经组合后形成输出嵌入,从而实现对全连接层矩阵乘法运算结果的近似估计。相较于执行矩阵乘法,从内存检索数据块是计算代价极低的操作。我们从头训练MemoryFormer模型,并在多个基准测试上开展广泛实验,验证了所提模型的有效性。