We propose a novel attention mechanism, named Cog Attention, that enables attention weights to be negative for enhanced expressiveness, which stems from two key factors: (1) Cog Attention can shift the token deletion and copying function from a static OV matrix to dynamic QK inner products, with the OV matrix now focusing more on refinement or modification. The attention head can simultaneously delete, copy, or retain tokens by assigning them negative, positive, or minimal attention weights, respectively. As a result, a single attention head becomes more flexible and expressive. (2) Cog Attention improves the model's robustness against representational collapse, which can occur when earlier tokens are over-squashed into later positions, leading to homogeneous representations. Negative weights reduce effective information paths from earlier to later tokens, helping to mitigate this issue. We develop Transformer-like models which use Cog Attention as attention modules, including decoder-only models for language modeling and U-ViT diffusion models for image generation. Experiments show that models using Cog Attention exhibit superior performance compared to those employing traditional softmax attention modules. Our approach suggests a promising research direction for rethinking and breaking the entrenched constraints of traditional softmax attention, such as the requirement for non-negative weights.
翻译:我们提出了一种新颖的注意力机制,称为Cog Attention,该机制允许注意力权重为负值以增强表达能力,这源于两个关键因素:(1) Cog Attention能够将词元删除和复制功能从静态的OV矩阵转移到动态的QK内积,而OV矩阵现在更侧重于精炼或修改。注意力头可以通过分别分配负值、正值或最小注意力权重来同时删除、复制或保留词元。因此,单个注意力头变得更加灵活且具有更强的表达能力。(2) Cog Attention提高了模型对表征塌陷的鲁棒性,当较早的词元被过度压缩到较晚位置时可能发生这种塌陷,导致同质化表征。负权重减少了从较早词元到较晚词元的有效信息路径,有助于缓解这一问题。我们开发了使用Cog Attention作为注意力模块的类Transformer模型,包括用于语言建模的仅解码器模型和用于图像生成的U-ViT扩散模型。实验表明,与采用传统softmax注意力模块的模型相比,使用Cog Attention的模型展现出更优越的性能。我们的方法为重新思考和突破传统softmax注意力的固有约束(例如非负权重要求)指明了一个有前景的研究方向。