Graph neural networks (GNN) have become an important class of neural network models that have gained popularity in domains such as social and financial network analysis. Different phases of GNN computations can be modeled using both dense and sparse matrix operations. There have been many frameworks and optimization techniques proposed in the literature to accelerate GNNs. However, getting consistently high performance across many input graphs with different sparsity patterns and GNN embedding sizes has remained difficult. In this paper, we propose different algebraic reassociations of GNN computations that lead to novel dense and sparse matrix primitive selections and compositions. We show that the profitability of these compositions depends on the input graph, embedding size, and the target hardware. We developed SENSEi, a system that uses a data-driven adaptive strategy to select the best composition given the input graph and GNN embedding sizes. Our evaluations on a wide range of graphs and embedding sizes show that SENSEi achieves geomean speedups of $1.105\times$ (up to $2.959\times$) and $1.187\times$ (up to $1.99\times$) on graph convolutional networks and geomean speedups of $2.307\times$ (up to $35.866\times$) and $1.44\times$ (up to $5.69\times$) on graph attention networks on CPUs and GPUs respectively over the widely used Deep Graph Library. Further, we show that the compositions yield notable synergistic performance benefits on top of other established sparse optimizations such as sparse matrix tiling by evaluating against a well-tuned baseline.
翻译:图神经网络(GNN)已成为一类重要的神经网络模型,在社交和金融网络分析等领域广受欢迎。GNN计算的不同阶段可通过稠密与稀疏矩阵运算共同建模。现有文献提出了众多框架与优化技术以加速GNN,然而,在面对具有不同稀疏模式与GNN嵌入尺寸的多种输入图时,持续获得高性能仍存在困难。本文提出GNN计算的不同代数重关联方法,从而产生新型稠密与稀疏矩阵原语的选择与组合。我们证明,这些组合的收益取决于输入图、嵌入尺寸及目标硬件。我们开发了SENSEi系统,该系统采用数据驱动的自适应策略,根据输入图与GNN嵌入尺寸选择最优组合。在多种图与嵌入尺寸上的评估表明,SENSEi在CPU和GPU上相对于广泛使用的Deep Graph Library,在图卷积网络中实现几何平均加速比$1.105\times$(最高$2.959\times$)与$1.187\times$(最高$1.99\times$),在图注意力网络中实现几何平均加速比$2.307\times$(最高$35.866\times$)与$1.44\times$(最高$5.69\times$)。此外,通过与经过良好调优的基线进行对比,我们证明这些组合在稀疏矩阵分块等其他成熟稀疏优化技术基础上,能产生显著的协同性能优势。