$p$-Laplacian regularization, rooted in graph and image signal processing, introduces a parameter $p$ to control the regularization effect on these data. Smaller values of $p$ promote sparsity and interpretability, while larger values encourage smoother solutions. In this paper, we first show that the self-attention mechanism obtains the minimal Laplacian regularization ($p=2$) and encourages the smoothness in the architecture. However, the smoothness is not suitable for the heterophilic structure of self-attention in transformers where attention weights between tokens that are in close proximity and non-close ones are assigned indistinguishably. From that insight, we then propose a novel class of transformers, namely the $p$-Laplacian Transformer (p-LaT), which leverages $p$-Laplacian regularization framework to harness the heterophilic features within self-attention layers. In particular, low $p$ values will effectively assign higher attention weights to tokens that are in close proximity to the current token being processed. We empirically demonstrate the advantages of p-LaT over the baseline transformers on a wide range of benchmark datasets.
翻译:$p$-Laplacian正则化源于图信号处理与图像信号处理领域,通过引入参数$p$来控制对这些数据的正则化效果。较小的$p$值促进稀疏性和可解释性,而较大的$p$值则鼓励更平滑的解。本文首先证明自注意力机制实际上实现了最小化Laplacian正则化($p=2$),这促使模型架构趋向平滑性。然而,这种平滑特性并不适用于Transformer中自注意力的异质性结构——在该结构中,邻近令牌与非邻近令牌之间的注意力权重被无差别地分配。基于这一发现,我们提出了一类新型Transformer,即$p$-Laplacian Transformer(p-LaT),它利用$p$-Laplacian正则化框架来捕捉自注意力层内的异质性特征。特别地,较低的$p$值能有效为当前处理令牌的邻近令牌分配更高的注意力权重。我们在广泛使用的基准数据集上实证证明了p-LaT相较于基线Transformer的优势。