Deep neural networks (DNNs) are of critical use in different domains. To accelerate DNN computation, tensor compilers are proposed to generate efficient code on different domain-specific accelerators. Existing tensor compilers mainly focus on optimizing computation efficiency. However, memory access is becoming a key performance bottleneck because the computational performance of accelerators is increasing much faster than memory performance. The lack of direct description of memory access and data dependence in current tensor compilers' intermediate representation (IR) brings significant challenges to generate memory-efficient code. In this paper, we propose IntelliGen, a tensor compiler that can generate high-performance code for memory-intensive operators by considering both computation and data movement optimizations. IntelliGen represent a DNN program using GIR, which includes primitives indicating its computation, data movement, and parallel strategies. This information will be further composed as an instruction-level dataflow graph to perform holistic optimizations by searching different memory access patterns and computation operations, and generating memory-efficient code on different hardware. We evaluate IntelliGen on NVIDIA GPU, AMD GPU, and Cambricon MLU, showing speedup up to 1.97x, 2.93x, and 16.91x(1.28x, 1.23x, and 2.31x on average), respectively, compared to current most performant frameworks.
翻译:深度神经网络(DNN)在不同领域具有关键应用价值。为加速DNN计算,研究者提出了张量编译器以在特定领域加速器上生成高效代码。现有张量编译器主要聚焦于优化计算效率,然而随着加速器计算性能的增长速度远超存储性能,内存访问正成为关键性能瓶颈。当前张量编译器中间表示(IR)缺乏对内存访问和数据依赖的直接描述,这给生成内存高效代码带来了重大挑战。本文提出IntelliGen——一种通过同时优化计算和数据移动来为内存密集型算子生成高性能代码的张量编译器。IntelliGen使用GIR表示DNN程序,其中包含指示计算、数据移动及并行策略的原语。这些信息将进一步组合为指令级数据流图,通过搜索不同内存访问模式和计算操作以执行全局优化,并在不同硬件上生成内存高效代码。我们在NVIDIA GPU、AMD GPU和寒武纪MLU上评估IntelliGen,与当前性能最优的框架相比,分别实现了最高1.97倍、2.93倍和16.91倍的加速比(平均加速比分别为1.28倍、1.23倍和2.31倍)。