We generalize the attention mechanism by viewing it through the lens of Entropic Optimal Transport, revealing that standard attention corresponds to a transport problem regularized by an implicit uniform prior. We introduce Generalized Optimal transport Attention with Trainable priors (GOAT), a new attention mechanism that replaces this naive assumption with a learnable, continuous prior. This prior maintains full compatibility with optimized kernels such as FlashAttention. GOAT also provides an EOT-based explanation of attention sinks and materializes a solution for them, avoiding the representational trade-offs of standard attention. Finally, by absorbing spatial information into the core attention computation, GOAT learns an extrapolatable prior that combines the flexibility of learned positional embeddings with the length generalization of fixed encodings.
翻译:本文通过熵最优传输的视角对注意力机制进行泛化,揭示标准注意力机制对应于由隐式均匀先验正则化的传输问题。我们提出了具有可训练先验的广义最优传输注意力机制(GOAT),该机制通过可学习的连续先验替代了这一朴素假设。该先验与FlashAttention等优化内核保持完全兼容。GOAT还基于EOT理论解释了注意力沉没现象,并为其提供了具体解决方案,避免了标准注意力的表征权衡问题。最后,通过将空间信息吸收到核心注意力计算中,GOAT学习到一种可外推的先验,该先验融合了学习式位置嵌入的灵活性与固定编码的长度泛化能力。