We propose a novel attention mechanism, named Cog Attention, that enables attention weights to be negative for enhanced expressiveness, which stems from two key factors: (1) Cog Attention can shift the token deletion and copying function from a static OV matrix to dynamic QK inner products, with the OV matrix now focusing more on refinement or modification. The attention head can simultaneously delete, copy, or retain tokens by assigning them negative, positive, or minimal attention weights, respectively. As a result, a single attention head becomes more flexible and expressive. (2) Cog Attention improves the model's robustness against representational collapse, which can occur when earlier tokens are over-squashed into later positions, leading to homogeneous representations. Negative weights reduce effective information paths from earlier to later tokens, helping to mitigate this issue. We develop Transformer-like models which use Cog Attention as attention modules, including decoder-only models for language modeling and U-ViT diffusion models for image generation. Experiments show that models using Cog Attention exhibit superior performance compared to those employing traditional softmax attention modules. Our approach suggests a promising research direction for rethinking and breaking the entrenched constraints of traditional softmax attention, such as the requirement for non-negative weights.
翻译:我们提出了一种名为Cog Attention的新型注意力机制,该机制允许注意力权重为负值以增强表达力,其优势源于两个关键因素:(1) Cog Attention能够将词元删除与复制功能从静态的OV矩阵转移至动态的QK内积计算,使OV矩阵更专注于精细化调整或修改。注意力头可通过为词元分配负值、正值或极小注意力权重,分别实现删除、复制或保留操作。这使得单个注意力头具备更高的灵活性与表达力。(2) Cog Attention能提升模型对表征塌缩的鲁棒性——该问题常因早期词元被过度压缩至后续位置导致表征同质化而产生。负权重可减少从早期到后期词元的有效信息路径,从而缓解此现象。我们开发了采用Cog Attention作为注意力模块的类Transformer模型,包括用于语言建模的纯解码器模型与用于图像生成的U-ViT扩散模型。实验表明,采用Cog Attention的模型相较于使用传统softmax注意力模块的模型展现出更优性能。本方法为重新审视并突破传统softmax注意力的固有约束(如非负权重要求)提供了具有前景的研究方向。