Pushing Tensor Accelerators Beyond MatMul in a User-Schedulable Language

Tensor accelerators now represent a growing share of compute resources in modern CPUs and GPUs. However, they are hard to program, leading developers to use vendor-provided kernel libraries that support tensor accelerators. As a result, the usage of tensor accelerators is limited to the provided interface, mainly designed for traditional ML and scientific computing workloads. In this paper, we show that tensor accelerators can improve the performance of applications beyond simple variants of MatMul. For example, many image processing pipelines are linear transformations over matrices in disguise and can therefore utilize such specialized hardware. This is nonetheless hindered by the difficulties in programming tensor accelerators. We tackle this problem with compiler-based techniques. We use the Halide user-schedulable language and express operations as Halide algorithms succinctly. To this end, we implement a flexible tensor instruction selector based on equality saturation. The tensor instruction selector supports both CPU- and GPU-attached tensor accelerators and works with existing scheduling operations (e.g., producer-consumer fusion). Together, this enables developers to write diverse accelerator-leveraging applications in a few dozen lines. Using our system, we demonstrate the potential of tensor accelerators beyond their traditional domains. We implement several image processing pipelines (e.g., filtering, resampling, and denoising) in our system and evaluate them against non-accelerator-leveraging baselines. We show that these pipelines can achieve significant speedups. For example, a downsampling routine is sped up by $6.1\times$ by utilizing Tensor Cores on an Nvidia RTX 4070 GPU.

翻译：张量加速器在现代CPU和GPU中正占据日益增长的计算资源份额。然而，其编程难度较高，导致开发者通常依赖厂商提供的支持张量加速器的核函数库。因此，张量加速器的使用被限制在预设的接口范围内，这些接口主要针对传统机器学习和科学计算工作负载设计。本文论证了张量加速器能够提升超越矩阵乘法简单变体的应用性能。例如，许多图像处理流水线实质上是矩阵的线性变换，因而可以利用此类专用硬件。但编程张量加速器的困难阻碍了这种应用拓展。我们通过基于编译器的技术解决该问题：采用Halide用户可调度语言，将运算简洁地表达为Halide算法；并实现基于等式饱和的灵活张量指令选择器。该指令选择器同时支持CPU与GPU附属张量加速器，且能与现有调度操作（如生产者-消费者融合）协同工作。由此，开发者可用数十行代码编写多样化的加速器赋能应用。基于本系统，我们展示了张量加速器在传统领域之外的潜力：实现了多个图像处理流水线（如滤波、重采样和去噪），并与未使用加速器的基线方案进行对比评估。实验表明这些流水线能获得显著的加速效果，例如在Nvidia RTX 4070 GPU上利用Tensor Cores使下采样例程实现了$6.1\times$的加速。