The training of graph neural networks (GNNs) is extremely time consuming because sparse graph-based operations are hard to be accelerated by hardware. Prior art explores trading off the computational precision to reduce the time complexity via sampling-based approximation. Based on the idea, previous works successfully accelerate the dense matrix based operations (e.g., convolution and linear) with negligible accuracy drop. However, unlike dense matrices, sparse matrices are stored in the irregular data format such that each row/column may have different number of non-zero entries. Thus, compared to the dense counterpart, approximating sparse operations has two unique challenges (1) we cannot directly control the efficiency of approximated sparse operation since the computation is only executed on non-zero entries; (2) sub-sampling sparse matrices is much more inefficient due to the irregular data format. To address the issues, our key idea is to control the accuracy-efficiency trade off by optimizing computation resource allocation layer-wisely and epoch-wisely. Specifically, for the first challenge, we customize the computation resource to different sparse operations, while limit the total used resource below a certain budget. For the second challenge, we cache previous sampled sparse matrices to reduce the epoch-wise sampling overhead. Finally, we propose a switching mechanisms to improve the generalization of GNNs trained with approximated operations. To this end, we propose Randomized Sparse Computation, which for the first time demonstrate the potential of training GNNs with approximated operations. In practice, rsc can achieve up to $11.6\times$ speedup for a single sparse operation and a $1.6\times$ end-to-end wall-clock time speedup with negligible accuracy drop.
翻译:图神经网络(GNN)的训练极为耗时,因为基于稀疏图的操作难以通过硬件加速。现有研究通过基于采样的近似方法,以牺牲计算精度为代价来降低时间复杂度。基于这一思路,先前的工作成功地在精度损失可忽略的前提下加速了基于稠密矩阵的操作(如卷积和线性变换)。然而,与稠密矩阵不同,稀疏矩阵以不规则数据格式存储,导致每行/列的非零元素数量不同。因此,相较于稠密操作,稀疏操作的近似面临两个独特挑战:(1)由于计算仅针对非零元素执行,我们无法直接控制近似稀疏操作的效率;(2)由于不规则的数据格式,对稀疏矩阵进行子采样效率极低。为解决这些问题,我们的核心思想是通过逐层和逐周期优化计算资源分配,来控制精度-效率的权衡。具体而言,针对第一个挑战,我们为不同的稀疏操作定制计算资源,同时将总资源使用量限制在固定预算内。针对第二个挑战,我们缓存先前采样的稀疏矩阵,以减少逐周期的采样开销。最后,我们提出一种切换机制,以提升通过近似操作训练的GNN模型的泛化能力。由此,我们提出了随机稀疏计算(RSC),首次证明了采用近似操作训练GNN的潜力。在实际应用中,RSC可在单个稀疏操作上实现高达$11.6\times$的加速,端到端实际耗时加速比可达$1.6\times$,且精度损失可忽略不计。