DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

Size Zheng,Xuegui Zheng,Hanshi Sun,Qi Hou,Wenlei Bao,Shiyu Li,Haojie Duanmu,Jin Fang,Chenli Xue,Chenhui Huang,Yuanqiang Liu,Renze Chen,Ningxin Zheng,Dongyang Wang,Li-Wen Chang,Liqiang Lu,Yun Liang,Jidong Zhai,Xin Liu

The scaling of large language models (LLMs) is currently bottlenecked by the rigidity of distributed programming. While high-performance libraries like CuBLAS and NCCL provide optimized primitives, they lack the flexibility required for rapidly evolving model architectures. Conversely, existing tensor compilers fail to address the complex memory hierarchy of distributed clusters effectively. To bridge this gap, we propose DITRON, a scalable tile-level compiler that democratizes high-performance distributed kernel development. DITRON introduces a novel hierarchical programming abstraction spanning Core, Device, and Task levels to map tensor programs efficiently onto heterogeneous distributed hardware. This abstraction allows DITRON to support diverse parallelism strategies while abstracting away the complexity of inter-node and intra-node communication. Evaluated across large-scale clusters, DITRON achieves performance parity with or exceeding expert-tuned CUDA libraries, delivering speedups of $6\%-30\%$ on isolated kernels and $5\%-30\%$ on end-to-end inference in vLLM. Furthermore, DITRON demonstrates strong portability, achieving significant speedups on both NVIDIA and AMD platforms. \ours{} has been deployed at the enterprise level for both training and inference. It achieves an MFU improvement of over 10\% in training tasks, saving approximately 500,000 GPU hours of training cost per month. For inference tasks, it delivers an end-to-end gain of over 20\% and has been applied to cloud service inference and edge inference scenarios.

翻译：当前大语言模型（LLMs）的扩展受限于分布式编程的僵化性。尽管CuBLAS和NCCL等高性能库提供了优化原语，但其缺乏应对快速演化的模型架构所需的灵活性。与此同时，现有张量编译器未能有效处理分布式集群的复杂存储层次。为弥合这一鸿沟，我们提出DITRON——一种可扩展的块级编译器，致力于普及高性能分布式内核开发。DITRON引入了一种新颖的层级化编程抽象（覆盖核心、设备与任务三级），可将张量程序高效映射到异构分布式硬件上。该抽象使DITRON能够支持多种并行策略，同时屏蔽节点间及节点内通信的复杂性。在大规模集群上的评估表明，DITRON实现了与专家调优的CUDA库相当甚至超越的性能：独立内核加速比为6%-30%，vLLM端到端推理加速比为5%-30%。此外，DITRON展现出强大的可移植性，在NVIDIA和AMD平台上均取得了显著加速效果。该编译器已在企业级训练与推理场景中部署。它使得训练任务的模型算力利用率（MFU）提升超过10%，每月节省约50万GPU小时的训练成本；针对推理任务，端到端性能提升超过20%，并已应用于云服务推理与边缘推理场景。