When training transformers on graph-structured data, incorporating information about the underlying topology is crucial for good performance. Topological masking, a type of relative position encoding, achieves this by upweighting or downweighting attention depending on the relationship between the query and keys in a graph. In this paper, we propose to parameterise topological masks as a learnable function of a weighted adjacency matrix -- a novel, flexible approach which incorporates a strong structural inductive bias. By approximating this mask with graph random features (for which we prove the first known concentration bounds), we show how this can be made fully compatible with linear attention, preserving $\mathcal{O}(N)$ time and space complexity with respect to the number of input tokens. The fastest previous alternative was $\mathcal{O}(N \log N)$ and only suitable for specific graphs. Our efficient masking algorithms provide strong performance gains for tasks on image and point cloud data, including with $>30$k nodes.
翻译:在基于图结构数据训练Transformer时,融入底层拓扑信息对获得优异性能至关重要。拓扑掩码作为一种相对位置编码方法,通过依据图中查询与键之间的关系增强或减弱注意力权重来实现这一目标。本文提出将拓扑掩码参数化为加权邻接矩阵的可学习函数——这是一种新颖灵活的方法,融入了强大的结构归纳偏置。通过使用图随机特征(我们首次证明了其已知的集中性界)对该掩码进行近似,我们展示了如何使其完全兼容线性注意力机制,在输入标记数量上保持$\mathcal{O}(N)$的时间与空间复杂度。先前最快的替代方案需$\mathcal{O}(N \log N)$复杂度且仅适用于特定图结构。我们提出的高效掩码算法在图像与点云数据任务(包括节点数$>30$k的场景)中实现了显著的性能提升。