The dot product self-attention (DPSA) is a fundamental component of transformers. However, scaling them to long sequences, like documents or high-resolution images, becomes prohibitively expensive due to quadratic time and memory complexities arising from the softmax operation. Kernel methods are employed to simplify computations by approximating softmax but often lead to performance drops compared to softmax attention. We propose SeTformer, a novel transformer, where DPSA is purely replaced by Self-optimal Transport (SeT) for achieving better performance and computational efficiency. SeT is based on two essential softmax properties: maintaining a non-negative attention matrix and using a nonlinear reweighting mechanism to emphasize important tokens in input sequences. By introducing a kernel cost function for optimal transport, SeTformer effectively satisfies these properties. In particular, with small and basesized models, SeTformer achieves impressive top-1 accuracies of 84.7% and 86.2% on ImageNet-1K. In object detection, SeTformer-base outperforms the FocalNet counterpart by +2.2 mAP, using 38% fewer parameters and 29% fewer FLOPs. In semantic segmentation, our base-size model surpasses NAT by +3.5 mIoU with 33% fewer parameters. SeTformer also achieves state-of-the-art results in language modeling on the GLUE benchmark. These findings highlight SeTformer's applicability in vision and language tasks.
翻译:点积自注意力(DPSA)是Transformer的核心组件。然而,将其扩展到长序列(如文档或高分辨率图像)时,由于softmax运算带来的二次时间复杂度和内存复杂度,计算成本变得过高。核方法通过近似softmax来简化计算,但相较于标准softmax注意力往往会导致性能下降。我们提出新型Transformer架构SeTformer,其中DPSA被完全替换为自最优传输(SeT),以实现更优性能与计算效率。SeT基于softmax的两个核心特性:保持非负注意力矩阵,以及利用非线性重加权机制强调输入序列中的重要标记。通过引入最优传输的核代价函数,SeTformer有效满足这些特性。实验表明,在ImageNet-1K数据集上,小型和基础型号的SeTformer分别取得84.7%和86.2%的顶级Top-1准确率。在目标检测任务中,SeTformer-base相比FocalNet同类模型,在参数量减少38%、FLOPs降低29%的情况下,mAP指标提升2.2。在语义分割任务中,基础型号以33%的参数量优势,mIoU超越NAT模型3.5。此外,SeTformer在GLUE语言理解基准测试中也取得最先进结果。这些发现充分展现了SeTformer在视觉与语言任务中的广泛适用性。