The pursuit of reducing the memory footprint of the self-attention mechanism in multi-headed self attention (MHA) spawned a rich portfolio of methods, e.g., group-query attention (GQA) and multi-head latent attention (MLA). The methods leverage specialized low-rank factorizations across embedding dimensions or attention heads. From the point of view of classical low-rank approximation, these methods are unconventional and raise questions of which objects they really approximate and how to interpret the low-rank behavior of the resulting representations. To answer these questions, this work proposes a generalized view on the weight objects in the self-attention layer and a factorization strategy, which allows us to construct a parameter efficient scheme, called Tucker Attention. Tucker Attention requires an order of magnitude fewer parameters for comparable validation metrics, compared to GQA and MLA, as evaluated in LLM and ViT test cases. Additionally, Tucker Attention~encompasses GQA, MLA, MHA as special cases and is fully compatible with flash-attention and rotary position embeddings (RoPE). This generalization strategy yields insights of the actual ranks achieved by MHA, GQA, and MLA, and further enables simplifications for MLA.
翻译:为降低多头自注意力(MHA)中自注意力机制的内存占用,研究者发展了一系列丰富的方法,例如分组查询注意力(GQA)和多头潜在注意力(MLA)。这些方法利用了嵌入维度或注意力头之间的专用低秩分解。从经典低秩近似的视角来看,这些方法并不传统,引发了关于它们实际近似何种对象以及如何解释所得表示的低秩行为等问题。为解答这些问题,本文提出了自注意力层中权重对象的泛化视角及一种分解策略,从而构建了一种参数高效方案,称为Tucker Attention。在LLM和ViT测试案例中,与GQA和MLA相比,Tucker Attention在可比验证指标下所需参数数量少一个数量级。此外,Tucker Attention将GQA、MLA、MHA作为特例涵盖其中,并完全兼容Flash Attention和旋转位置嵌入(RoPE)。该泛化策略揭示了MHA、GQA和MLA实际达到的秩,并进一步实现了对MLA的简化。