Pipelining between data loading and computation is a critical tensor program optimization for GPUs. In order to unleash the high performance of latest GPUs, we must perform a synergetic optimization of multi-stage pipelining across the multi-level buffer hierarchy of GPU. Existing frameworks rely on hand-written libraries such as cuBLAS to perform pipelining optimization, which is inextensible to new operators and un-composable with prior tensor compiler optimizations. This paper presents ALCOP, the first framework that is compiler-native and fully supports multi-stage multi-level pipelining. ALCOP overcomes three critical obstacles in generating code for pipelining: detection of pipelining-applicable buffers, program transformation for multi-level multi-stage pipelining, and efficient schedule parameter search by incorporating static analysis. Experiments show that ALCOP can generate programs with 1.23x speedup on average (up to 1.73x) over vanilla TVM. On end-to-end models, ALCOP can improve upon TVM by up to 1.18x, and XLA by up to 1.64x. Besides, our performance model significantly improves the efficiency of the schedule tuning process and can find schedules with 99% of the performance given by exhaustive search while costing 40x fewer trials.
翻译:数据加载与计算之间的流水线是GPU上关键的张量程序优化技术。为充分发挥新一代GPU的高性能,需在GPU多级缓冲层次架构上协同优化多阶段流水线。现有框架依赖cuBLAS等手工编写库实现流水线优化,这既无法扩展至新算子,也无法与现有张量编译器优化方案组合使用。本文提出ALCOP——首个原生编译器框架,完整支持多阶段多级流水线。ALCOP攻克了流水线代码生成的三大关键障碍:可流水化缓冲区的检测、多级多阶段流水线的程序变换,以及通过静态分析实现的高效调度参数搜索。实验表明,相比原生TVM,ALCOP生成的程序平均加速1.23倍(最高达1.73倍)。在端到端模型上,ALCOP较TVM最高提升1.18倍,较XLA最高提升1.64倍。此外,我们的性能模型显著提升了调度调优效率,仅需穷举搜索1/40的试验次数即可获得99%的调度性能。