We argue that Transformers are essentially graph-to-graph models, with sequences just being a special case. Attention weights are functionally equivalent to graph edges. Our Graph-to-Graph Transformer architecture makes this ability explicit, by inputting graph edges into the attention weight computations and predicting graph edges with attention-like functions, thereby integrating explicit graphs into the latent graphs learned by pretrained Transformers. Adding iterative graph refinement provides a joint embedding of input, output, and latent graphs, allowing non-autoregressive graph prediction to optimise the complete graph without any bespoke pipeline or decoding strategy. Empirical results show that this architecture achieves state-of-the-art accuracies for modelling a variety of linguistic structures, integrating very effectively with the latent linguistic representations learned by pretraining.
翻译:我们认为Transformer本质上是一种图到图模型,而序列仅是其特例。注意力权重在功能上等同于图边。我们的图到图Transformer架构通过将图边输入注意力权重计算,并使用类注意力函数预测图边,从而将显式图整合到预训练Transformer学习的隐式图中,使这种图建模能力显式化。通过引入迭代图精炼机制,该架构实现了输入图、输出图与隐式图的联合嵌入,使得非自回归图预测无需任何定制化流水线或解码策略即可优化完整图结构。实验结果表明,该架构在多种语言结构建模中取得了最先进的准确率,并能与预训练所学到的隐式语言表征实现高效融合。