We present an approach to modifying Transformer architectures by integrating graph-aware relational reasoning into the attention mechanism, merging concepts from graph neural networks and language modeling. Building on the inherent connection between attention and graph theory, we reformulate the Transformer's attention mechanism as a graph operation and propose Graph-Aware Isomorphic Attention. This method leverages advanced graph modeling strategies, including Graph Isomorphism Networks (GIN) and Principal Neighborhood Aggregation (PNA), to enrich the representation of relational structures. Our approach captures complex dependencies and generalizes across tasks, as evidenced by a reduced generalization gap and improved learning performance. Additionally, we expand the concept of graph-aware attention to introduce Sparse GIN-Attention, a fine-tuning approach that employs sparse GINs. By interpreting attention matrices as sparse adjacency graphs, this technique enhances the adaptability of pre-trained foundational models with minimal computational overhead, endowing them with graph-aware capabilities. Sparse GIN-Attention fine-tuning achieves improved training dynamics and better generalization compared to alternative methods like low-rank adaption (LoRA). We discuss latent graph-like structures within traditional attention mechanisms, offering a new lens through which Transformers can be understood. By evolving Transformers as hierarchical GIN models for relational reasoning. This perspective suggests profound implications for foundational model development, enabling the design of architectures that dynamically adapt to both local and global dependencies. Applications in bioinformatics, materials science, language modeling, and beyond could benefit from this synthesis of relational and sequential data modeling, setting the stage for interpretable and generalizable modeling strategies.
翻译:我们提出了一种改进Transformer架构的方法,该方法通过将图感知关系推理集成到注意力机制中,融合了图神经网络与语言建模的概念。基于注意力机制与图论之间的内在联系,我们将Transformer的注意力机制重新表述为图操作,并提出了图感知同构注意力。该方法利用先进的图建模策略,包括图同构网络(GIN)和主邻域聚合(PNA),以丰富关系结构的表示。我们的方法能够捕获复杂的依赖关系并在任务间实现泛化,这通过泛化间隙的减小和学习性能的提升得以证明。此外,我们扩展了图感知注意力的概念,引入了稀疏GIN注意力——一种采用稀疏GIN的微调方法。通过将注意力矩阵解释为稀疏邻接图,该技术以最小的计算开销增强了预训练基础模型的适应性,并赋予其图感知能力。与低秩适应(LoRA)等其他方法相比,稀疏GIN注意力微调实现了更好的训练动态和更强的泛化性能。我们探讨了传统注意力机制中潜在的类图结构,为理解Transformer提供了一个新的视角。通过将Transformer演化为用于关系推理的层次化GIN模型,这一视角为基础模型的开发提出了深刻的启示,使得能够设计出动态适应局部和全局依赖关系的架构。在生物信息学、材料科学、语言建模等领域的应用,都可能受益于这种关系与序列数据建模的融合,为可解释和可泛化的建模策略奠定了基础。