Dynamic sparsity, where the sparsity patterns are unknown until runtime, poses a significant challenge to deep learning. The state-of-the-art sparsity-aware deep learning solutions are restricted to pre-defined, static sparsity patterns due to significant overheads associated with preprocessing. Efficient execution of dynamic sparse computation often faces the misalignment between the GPU-friendly tile configuration for efficient execution and the sparsity-aware tile shape that minimizes coverage wastes (non-zero values in tensor). In this paper, we propose PIT, a deep-learning compiler for dynamic sparsity. PIT proposes a novel tiling mechanism that leverages Permutation Invariant Transformation (PIT), a mathematically proven property, to transform multiple sparsely located micro-tiles into a GPU-efficient dense tile without changing the computation results, thus achieving both high GPU utilization and low coverage waste. Given a model, PIT first finds feasible PIT rules for all its operators and generates efficient GPU kernels accordingly. At runtime, with the novel SRead and SWrite primitives, PIT rules can be executed extremely fast to support dynamic sparsity in an online manner. Extensive evaluation on diverse models shows that PIT can accelerate dynamic sparsity computation by up to 5.9x (average 2.43x) over state-of-the-art compilers.
翻译:动态稀疏性(稀疏模式在运行时未知)对深度学习构成了重大挑战。由于预处理带来的显著开销,当前最先进的稀疏感知深度学习解决方案仅限于预定义、静态的稀疏模式。动态稀疏计算的高效执行常面临GPU友好瓦片配置(用于高效执行)与稀疏感知瓦片形状(最小化覆盖浪费,即张量中的非零值)之间的错配问题。本文提出PIT,一种面向动态稀疏性的深度学习编译器。PIT提出了一种新颖的瓦片机制,利用置换不变变换(一种经数学证明的性质),将多个稀疏分布的微瓦片转换为一组GPU高效稠密瓦片,且不改变计算结果,从而同时实现高GPU利用率和低覆盖浪费。对于给定模型,PIT首先为其所有算子寻找可行的PIT规则,并据此生成高效的GPU内核。在运行时,通过新颖的SRead和SWrite原语,PIT规则可极速执行,以在线方式支持动态稀疏性。在多种模型上的广泛评估表明,相较于最先进的编译器,PIT可将动态稀疏计算加速高达5.9倍(平均2.43倍)。