Efficient Graph Knowledge Distillation from GNNs to Kolmogorov--Arnold Networks via Self-Attention Dynamic Sampling

Recent success of graph neural networks (GNNs) in modeling complex graph-structured data has fueled interest in deploying them on resource-constrained edge devices. However, their substantial computational and memory demands present ongoing challenges. Knowledge distillation (KD) from GNNs to MLPs offers a lightweight alternative, but MLPs remain limited by fixed activations and the absence of neighborhood aggregation, constraining distilled performance. To tackle these intertwined limitations, we propose SA-DSD, a novel self-attention-guided dynamic sampling distillation framework. To the best of our knowledge, this is the first work to employ an enhanced Kolmogorov-Arnold Network (KAN) as the student model. We improve Fourier KAN (FR-KAN+) with learnable frequency bases, phase shifts, and optimized algorithms, substantially improving nonlinear fitting capability over MLPs while preserving low computational complexity. To explicitly compensate for the absence of neighborhood aggregation that is inherent to both MLPs and KAN-based students, SA-DSD leverages a self-attention mechanism to dynamically identify influential nodes, construct adaptive sampling probability matrices, and enforce teacher-student prediction consistency. Extensive experiments on six real world datasets demonstrate that, under inductive and most of transductive settings, SA-DSD surpasses three GNN teachers by 3.05%-3.62% and improves FR-KAN+ by 15.61%. Moreover, it achieves a 16.69x parameter reduction and a 55.75% decrease in average runtime per epoch compared to key benchmarks.

翻译：图神经网络（GNNs）在建模复杂图结构数据方面的近期成功，推动了其在资源受限的边缘设备上部署的兴趣。然而，其巨大的计算与内存需求仍构成持续挑战。从GNNs到多层感知机（MLPs）的知识蒸馏（KD）提供了一种轻量级替代方案，但MLPs仍受限于固定的激活函数与缺乏邻域聚合机制，制约了蒸馏性能。为应对这些相互交织的局限性，本文提出SA-DSD，一种新颖的基于自注意力引导的动态采样蒸馏框架。据我们所知，这是首个采用增强型Kolmogorov-Arnold网络（KAN）作为学生模型的研究。我们通过引入可学习的频率基、相位偏移及优化算法改进了傅里叶KAN（FR-KAN+），在保持低计算复杂度的同时，显著提升了相对于MLPs的非线性拟合能力。为显式补偿MLPs与基于KAN的学生模型均固有的邻域聚合缺失，SA-DSD利用自注意力机制动态识别关键节点、构建自适应采样概率矩阵，并强制保持师生预测一致性。在六个真实世界数据集上的大量实验表明，在归纳式及多数直推式设定下，SA-DSD超越三种GNN教师模型3.05%–3.62%，并将FR-KAN+性能提升15.61%。此外，与关键基准模型相比，其实现了16.69倍的参数量缩减，且每轮平均运行时间降低55.75%。