Code generation and runtime techniques for enabling data-efficient deep learning training on GPUs

As deep learning models scale, their training cost has surged significantly. Due to both hardware advancements and limitations in current software stacks, the need for data efficiency has risen. Data efficiency refers to the effective hiding of data access latency and the avoidance of unnecessary data movements. Major challenges arise from the growing disparity between GPU memory bandwidth and computational throughput, imminent GPU memory capacity limitations, and inefficiencies in the PyTorch software stack, including a lack of device-specific PCIe transfer optimizations and high-level domain-specific abstractions. To effectively mitigate these data inefficiencies for deep learning training, this dissertation analyzes data inefficiency in representative deep training tasks, specifically in graph neural networks (GNNs) and large language models (LLMs). It then proposes novel runtime and code generation techniques to mitigate these challenges and implements these optimizations seamlessly within the PyTorch stack while maintaining strong programmability and interoperability. First, PyTorch-Direct is devised to incorporate the GPU-centric PCIe data transfer paradigm in PyTorch for GNN training. Next, Hector intermediate representation (IR) and its code generator are proposed to introduce domain-specific high-level abstraction and systematically address memory-intensive performance challenges for relational GNNs. Finally, in LLM training, the throughput has been increasingly constrained by GPU memory capacity. To mitigate this, the SSDTrain offloading framework is designed and implemented. Together, these contributions show that code generation and runtime techniques can systematically mitigate the data management bottlenecks in deep learning training, which stem from the data-intensive nature of workloads and the oversimplification inherent in the deep learning training software stack.

翻译：随着深度学习模型规模的扩大，其训练成本显著飙升。由于硬件进步与当前软件栈的限制，对数据效率的需求日益增长。数据效率指有效隐藏数据访问延迟并避免不必要的数据移动。主要挑战源于GPU内存带宽与计算吞吐量之间日益扩大的差距、迫在眉睫的GPU内存容量限制，以及PyTorch软件栈中的低效问题，包括缺乏设备特定的PCIe传输优化和高级领域特定抽象。为有效缓解深度学习训练中的数据低效问题，本论文分析了代表性深度训练任务（特别是图神经网络（GNNs）和大语言模型（LLMs））中的数据低效现象。随后提出新颖的运行时与代码生成技术以应对这些挑战，并将这些优化无缝集成到PyTorch栈中，同时保持强大的可编程性与互操作性。首先，设计了PyTorch-Direct，将以GPU为中心的PCIe数据传输范式引入PyTorch以用于GNN训练。其次，提出了Hector中间表示（IR）及其代码生成器，以引入领域特定的高级抽象，并系统性地解决关系型GNN面临的内存密集型性能挑战。最后，在LLM训练中，吞吐量日益受到GPU内存容量的限制。为缓解此问题，设计并实现了SSDTrain卸载框架。这些贡献共同表明，代码生成与运行时技术能够系统性地缓解深度学习训练中的数据管理瓶颈，这些瓶颈源于工作负载的数据密集型特性以及深度学习训练软件栈固有的过度简化问题。