Revisiting Transformers with Insights from Image Filtering and Boosting

The self-attention mechanism, a cornerstone of Transformer-based state-of-the-art deep learning architectures, is largely heuristic-driven and fundamentally challenging to interpret. Establishing a robust theoretical foundation to explain its remarkable success and limitations has therefore become an increasingly prominent focus in recent research. Some notable directions have explored understanding self-attention through the lens of image denoising and nonparametric regression. While promising, existing frameworks still lack a deeper mechanistic interpretation of various architectural components that enhance self-attention, both in its original formulation and subsequent variants. In this work, we aim to advance this understanding by developing a unifying image processing framework, capable of explaining not only the self-attention computation itself but also the role of components such as positional encoding and residual connections, including numerous later variants. We also pinpoint potential distinctions between the two concepts building upon our framework, and make effort to close this gap. We introduce two independent architectural modifications within transformers. While our primary objective is interpretability, we empirically observe that image processing-inspired modifications can also lead to notably improved accuracy and robustness against data contamination and adversaries across language and vision tasks as well as better long sequence understanding.

翻译：自注意力机制作为基于Transformer的先进深度学习架构的基石，很大程度上是由启发式驱动的，且本质上难以解释。因此，建立坚实的理论基础以解释其显著成功与局限性，已成为近期研究中日益突出的焦点。一些重要方向尝试通过图像去噪和非参数回归的视角来理解自注意力。尽管前景广阔，但现有框架仍缺乏对增强自注意力的各种架构组件（无论是其原始形式还是后续变体）更深入的机制性解释。在本工作中，我们旨在通过构建一个统一的图像处理框架来推进这一理解，该框架不仅能够解释自注意力计算本身，还能阐明位置编码和残差连接等组件的作用，涵盖众多后续变体。基于我们的框架，我们还指出了这两个概念之间潜在的区别，并努力弥合这一差距。我们在Transformer中引入了两种独立的架构修改。虽然我们的主要目标是可解释性，但我们通过实验观察到，受图像处理启发的修改也能在语言和视觉任务中显著提高精度、增强对数据污染和对抗攻击的鲁棒性，并改善对长序列的理解能力。

相关内容

自注意力

关注 13

利用注意力机制来“动态”地生成不同连接的权重，这就是自注意力模型（Self-Attention Model）. 注意力机制模仿了生物观察行为的内部过程，即一种将内部经验和外部感觉对齐从而增加部分区域的观察精细度的机制。注意力机制可以快速提取稀疏数据的重要特征，因而被广泛用于自然语言处理任务，特别是机器翻译。而自注意力机制是注意力机制的改进，其减少了对外部信息的依赖，更擅长捕捉数据或特征的内部相关性

【CVPR2023】BiFormer:基于双层路由注意力的视觉Transformer

专知会员服务

35+阅读 · 2023年3月20日

144页ppt！《Transformers》全面讲解，附视频

专知会员服务

118+阅读 · 2023年1月1日

Transformers如何用于遥感？阿联酋MBZUAI最新《Transformers遥感处理》综述，涵盖60+种ViT遥感方法

专知会员服务

58+阅读 · 2022年9月6日