Transductive tasks on graphs differ fundamentally from typical supervised machine learning tasks, as the independent and identically distributed (i.i.d.) assumption does not hold among samples. Instead, all train/test/validation samples are present during training, making them more akin to a semi-supervised task. These differences make the analysis of the models substantially different from other models. Recently, Graph Transformers have significantly improved results on these datasets by overcoming long-range dependency problems. However, the quadratic complexity of full Transformers has driven the community to explore more efficient variants, such as those with sparser attention patterns. While the attention matrix has been extensively discussed, the hidden dimension or width of the network has received less attention. In this work, we establish some theoretical bounds on how and under what conditions the hidden dimension of these networks can be compressed. Our results apply to both sparse and dense variants of Graph Transformers.
翻译:图上的转导任务与典型的监督机器学习任务存在根本差异,因为样本间不满足独立同分布假设。相反,所有训练/测试/验证样本在训练期间同时存在,使其更接近于半监督任务。这些差异使得模型分析与其他模型存在显著不同。近年来,图Transformer通过克服长程依赖问题,在这些数据集上取得了显著改进的效果。然而,完整Transformer的二次复杂度促使学界探索更高效的变体,例如具有稀疏注意力模式的变体。尽管注意力矩阵已被广泛讨论,但网络的隐藏维度(宽度)却较少受到关注。在本研究中,我们建立了关于这些网络的隐藏维度在何种条件下以及如何被压缩的理论边界。我们的研究结果同时适用于稀疏和稠密版本的图Transformer。