Temporal Interaction Graphs (TIGs) are widely employed to model intricate real-world systems such as financial systems and social networks. To capture the dynamism and interdependencies of nodes, existing TIG embedding models need to process edges sequentially and chronologically. However, this requirement prevents it from being processed in parallel and struggle to accommodate burgeoning data volumes to GPU. Consequently, many large-scale temporal interaction graphs are confined to CPU processing. Furthermore, a generalized GPU scaling and acceleration approach remains unavailable. To facilitate large-scale TIGs' implementation on GPUs for acceleration, we introduce a novel training approach namely Streaming Edge Partitioning and Parallel Acceleration for Temporal Interaction Graph Embedding (SPEED). The SPEED is comprised of a Streaming Edge Partitioning Component (SEP) which addresses space overhead issue by assigning fewer nodes to each GPU, and a Parallel Acceleration Component (PAC) which enables simultaneous training of different sub-graphs, addressing time overhead issue. Our method can achieve a good balance in computing resources, computing time, and downstream task performance. Empirical validation across 7 real-world datasets demonstrates the potential to expedite training speeds by a factor of up to 19.29x. Simultaneously, resource consumption of a single-GPU can be diminished by up to 69%, thus enabling the multiple GPU-based training and acceleration encompassing millions of nodes and billions of edges. Furthermore, our approach also maintains its competitiveness in downstream tasks.
翻译:时序交互图(Temporal Interaction Graphs, TIGs)被广泛应用于金融系统、社交网络等复杂真实系统的建模。为捕获节点的动态性与相互依赖性,现有TIG嵌入模型需要按时间顺序依次处理边。然而,这一要求导致其无法并行处理,且难以适应日益增长的数据量以适配GPU。因此,大规模时序交互图通常仅限于CPU处理。此外,目前尚缺乏通用的GPU扩展与加速方法。为推动大规模TIG在GPU上实现加速部署,我们提出了一种新颖的训练方法——面向时序交互图嵌入的流式边划分与并行加速(SPEED)。SPEED包含两个核心组件:流式边划分组件(SEP)通过为每个GPU分配更少的节点来解决空间开销问题;并行加速组件(PAC)则通过支持不同子图的同步训练来应对时间开销问题。该方法能够在计算资源、计算时间与下游任务性能之间实现良好平衡。在7个真实数据集上的实验验证表明,该方法可将训练速度提升高达19.29倍,同时单GPU的资源消耗最多降低69%,从而支持基于多GPU的训练与加速,覆盖数百万节点和数十亿条边。此外,该方法在下游任务中仍保持竞争力。