Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs

As deep learning models nowadays are widely adopted by both cloud services and edge devices, reducing the latency of deep learning model inferences becomes crucial to provide efficient model serving. However, it is challenging to develop efficient tensor programs for deep learning operators due to the high complexity of modern accelerators and the rapidly growing number of operators. Deep learning compilers, such as Apache TVM, adopt declarative scheduling primitives to lower the bar of developing tensor programs. However, we show that this approach is insufficient to cover state-of-the-art tensor program optimizations. In this paper, we propose to embed the scheduling process into tensor programs and use dedicated mappings, called task mappings, to define the computation assignment and ordering. This new approach greatly enriches the expressible optimizations by allowing developers to manipulate tensor programs at a much finer granularity. We call the proposed method the task-mapping programming paradigm. In addition, we propose a new post-scheduling fusion optimization that allows developers to focus on scheduling every single operator and automates the fusion after scheduling. It greatly reduces the engineering efforts for operator fusion. Our proposed paradigm also constructs an efficient hardware-centric schedule space, which is agnostic to the program input size and greatly reduces the tuning time. With the proposed paradigm, we implement a deep learning compiler Hidet. Extensive experiments on modern convolution and transformer models show that Hidet outperforms state-of-the-art DNN inference framework, ONNX Runtime, and compiler, TVM equipped with scheduler AutoTVM and Ansor, by up to 1.48x (1.22x on average). It also reduces the tuning time by 20x and 11x compared with AutoTVM and Ansor, respectively. We open-sourced hidet at https://www.github.com/hidet-org/hidet.

翻译：摘要：随着深度学习模型如今被云服务和边缘设备广泛采用，降低深度学习模型推理的延迟对于提供高效的模型服务至关重要。然而，由于现代加速器的高度复杂性和算子数量的快速增长，为深度学习算子开发高效的张量程序颇具挑战性。深度学习编译器（如Apache TVM）采用声明式调度原语来降低张量程序的开发门槛。然而，我们证明这种方法不足以涵盖最先进的张量程序优化技术。在本文中，我们建议将调度过程嵌入张量程序中，并使用称为“任务映射”的专用映射来定义计算分配与顺序。这一新方法通过允许开发者以更细粒度操纵张量程序，极大地丰富了可表达的优化手段。我们将所提出的方法称为任务映射编程范式。此外，我们提出了一种新的调度后融合优化，使开发者能够专注于调度每个单一算子，并在调度后自动完成融合，从而大幅减少算子融合的工程工作量。我们的范式还构建了一个高效的硬件导向调度空间，该空间与程序输入大小无关，并显著减少了调优时间。基于所提出的范式，我们实现了一个深度学习编译器Hidet。在现代卷积和Transformer模型上的大量实验表明，Hidet的性能比最先进的DNN推理框架ONNX Runtime以及配备调度器AutoTVM和Ansor的编译器TVM高出1.48倍（平均1.22倍）。与AutoTVM和Ansor相比，它还将调优时间分别减少了20倍和11倍。我们已在https://www.github.com/hidet-org/hidet 开源了Hidet。