Sparse tensor algebra is a challenging class of workloads to accelerate due to low arithmetic intensity and varying sparsity patterns. Prior sparse tensor algebra accelerators have explored tiling sparse data to increase exploitable data reuse and improve throughput, but typically allocate tile size in a given buffer for the worst-case data occupancy. This severely limits the utilization of available memory resources and reduces data reuse. Other accelerators employ complex tiling during preprocessing or at runtime to determine the exact tile size based on its occupancy. This paper proposes a speculative tensor tiling approach, called overbooking, to improve buffer utilization by taking advantage of the distribution of nonzero elements in sparse tensors to construct larger tiles with greater data reuse. To ensure correctness, we propose a low-overhead hardware mechanism, Tailors, that can tolerate data overflow by design while ensuring reasonable data reuse. We demonstrate that Tailors can be easily integrated into the memory hierarchy of an existing sparse tensor algebra accelerator. To ensure high buffer utilization with minimal tiling overhead, we introduce a statistical approach, Swiftiles, to pick a tile size so that tiles usually fit within the buffer's capacity, but can potentially overflow, i.e., it overbooks the buffers. Across a suite of 22 sparse tensor algebra workloads, we show that our proposed overbooking strategy introduces an average speedup of $52.7\times$ and $2.3\times$ and an average energy reduction of $22.5\times$ and $2.5\times$ over ExTensor without and with optimized tiling, respectively.
翻译:稀疏张量代数因其低算术密度和变化的稀疏模式而成为一类难以加速的工作负载。先前的稀疏张量代数加速器已探索通过分块稀疏数据来增加可利用的数据复用并提高吞吐量,但通常在给定缓冲区中为最坏情况下的数据占用分配块大小。这严重限制了可用内存资源的利用率并降低了数据复用。其他加速器在预处理期间或运行时采用复杂的分块策略,根据占用情况确定精确的块大小。本文提出一种称为超量分配的推测性张量分块方法,通过利用稀疏张量中非零元素的分布来构建具有更高数据复用的更大分块,从而提高缓冲区利用率。为确保正确性,我们提出一种低开销硬件机制Tailors,该机制在设计中能够容忍数据溢出,同时确保合理的数据复用。我们证明Tailors可以轻松集成到现有稀疏张量代数加速器的内存层次结构中。为确保高缓冲区利用率且分块开销最小,我们引入一种统计方法Swiftiles来选择分块大小,使得分块通常能容纳在缓冲区容量内,但可能发生溢出,即对缓冲区进行超量分配。在22个稀疏张量代数工作负载的测试集中,我们表明所提出的超量分配策略相比未优化分块和优化分块的ExTensor,分别实现了平均$52.7\times$和$2.3\times$的加速,以及平均$22.5\times$和$2.5\times$的能耗降低。