DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

The rapid growth of deep learning models has increased the demand for efficient distributed training strategies. Fully sharded approaches like ZeRO-3 and FSDP partition model parameters across GPUs and apply optimizations such as prefetching and unsharding to reduce communication overhead. However, these systems lack fine-grained control over memory and communication scheduling, making it difficult to balance computation--communication overlap with memory requirements. Coordinating multiple optimizations such as prefetching and unsharding is also difficult, since their effects on memory usage can influence each other. To tackle these challenges, we propose DeepCompile, a compiler-based optimization framework for distributed training. DeepCompile transforms user-defined models into computation graphs and applies a series of profiling-guided optimization passes, each modifying the graph based on profiling information such as execution time and memory usage. This design allows each pass to flexibly insert, reorder, or remove operations such as all-gather and memory allocation, improving communication--computation overlap and reducing memory pressure. Each pass can access updated profiling feedback from earlier passes, enabling coordinated optimizations. We further enhance DeepCompile by three additional optimizations: proactive prefetching, selective unsharding, and adaptive offloading. Our evaluation shows that DeepCompile achieves up to 1.28$\times$ and 1.54$\times$ speedups over ZeRO-3 and FSDP baselines, respectively, and up to a 7.01$\times$ throughput increase in settings with limited GPU resources using offloading.

翻译：深度学习模型的快速增长提升了对高效分布式训练策略的需求。诸如ZeRO-3和FSDP等全分片方法将模型参数划分至多个GPU，并采用预取与去分片等优化技术以降低通信开销。然而，这些系统缺乏对内存与通信调度的细粒度控制，难以在计算-通信重叠与内存需求之间取得平衡。协调预取与去分片等多种优化亦具挑战性，因其对内存使用的影响可能相互干扰。为应对这些挑战，我们提出DeepCompile——一个基于编译器的分布式训练优化框架。DeepCompile将用户定义模型转换为计算图，并应用一系列基于性能剖析的优化过程，每个过程根据执行时间与内存使用等剖析信息对计算图进行修改。该设计允许每个优化过程灵活地插入、重排或移除全收集与内存分配等操作，从而提升通信-计算重叠度并降低内存压力。每个优化过程均可获取先前过程更新的性能反馈，实现协同优化。我们通过三项额外优化进一步增强DeepCompile：主动预取、选择性去分片与自适应卸载。实验评估表明，DeepCompile相较于ZeRO-3和FSDP基线分别实现最高1.28倍与1.54倍的加速比，在使用卸载的有限GPU资源场景下最高可获得7.01倍的吞吐量提升。