Inference on large language models can be expensive in terms of the compute and memory costs involved, especially when long sequence lengths are used. In particular, the self-attention mechanism used in such models contributes significantly to these costs, which has resulted in several recent works that propose sparse attention approximations for inference. In this work, we propose to approximate the self-attention computation by focusing on the dimensionality of key vectors computed in the attention block. Our analysis reveals that the key vectors lie in a significantly lower-dimensional space, consistently across several datasets and models. Exploiting this observation, we propose Loki, a novel sparse attention method that ranks and selects tokens in the KV-cache based on attention scores computed in low-dimensional space. Our evaluations show that Loki is able to maintain the efficacy of the models better than other popular approximation methods, while speeding up the attention computation due to reduced data movement (load/store) and compute costs.
翻译:大型语言模型的推理过程在计算和内存成本方面可能非常昂贵,尤其是在使用长序列长度时。其中,此类模型使用的自注意力机制对这些成本贡献显著,这导致了近期多项工作提出了用于推理的稀疏注意力近似方法。在本工作中,我们提出通过关注注意力块中计算的键向量的维度来近似自注意力计算。我们的分析表明,键向量存在于一个显著更低维的空间中,这一现象在多个数据集和模型上保持一致。利用这一观察,我们提出了Loki,一种新颖的稀疏注意力方法,该方法基于低维空间中计算的注意力分数对KV缓存中的令牌进行排序和选择。我们的评估表明,Loki能够比其他流行的近似方法更好地保持模型效能,同时由于减少了数据移动(加载/存储)和计算成本,从而加快了注意力计算速度。