The remarkable success of transformers in sequence modeling tasks, spanning various applications in natural language processing and computer vision, is attributed to the critical role of self-attention. Similar to the development of most deep learning models, the construction of these attention mechanisms relies on heuristics and experience. In our work, we derive self-attention from kernel principal component analysis (kernel PCA) and show that self-attention projects its query vectors onto the principal component axes of its key matrix in a feature space. We then formulate the exact formula for the value matrix in self-attention, theoretically and empirically demonstrating that this value matrix captures the eigenvectors of the Gram matrix of the key vectors in self-attention. Leveraging our kernel PCA framework, we propose Attention with Robust Principal Components (RPC-Attention), a novel class of robust attention that is resilient to data contamination. We empirically demonstrate the advantages of RPC-Attention over softmax attention on the ImageNet-1K object classification, WikiText-103 language modeling, and ADE20K image segmentation task.
翻译:Transformer在序列建模任务中取得的显著成功,横跨自然语言处理和计算机视觉的多种应用,归功于自注意力机制的关键作用。与大多数深度学习模型的开发类似,这些注意力机制的构建依赖于启发式方法和经验。在我们的工作中,我们从核主成分分析(kernel PCA)推导出自注意力机制,并证明自注意力将其查询向量投影到特征空间中其键矩阵的主成分轴上。随后,我们推导出自注意力中值矩阵的精确公式,从理论和实验上证明该值矩阵捕获了自注意力中键向量格拉姆矩阵的特征向量。基于我们的核主成分分析框架,我们提出了具有鲁棒主成分的注意力机制(RPC-Attention),这是一类对数据污染具有鲁棒性的新型注意力机制。我们在ImageNet-1K图像分类、WikiText-103语言建模和ADE20K图像分割任务上,通过实验验证了RPC-Attention相较于softmax注意力的优势。