Sparse tensor algebra is a challenging class of workloads to accelerate due to low arithmetic intensity and varying sparsity patterns. Prior sparse tensor algebra accelerators have explored tiling sparse data to increase exploitable data reuse and improve throughput, but typically allocate tile size in a given buffer for the worst-case data occupancy. This severely limits the utilization of available memory resources and reduces data reuse. Other accelerators employ complex tiling during preprocessing or at runtime to determine the exact tile size based on its occupancy. This paper proposes a speculative tensor tiling approach, called overbooking, to improve buffer utilization by taking advantage of the distribution of nonzero elements in sparse tensors to construct larger tiles with greater data reuse. To ensure correctness, we propose a low-overhead hardware mechanism, Tailors, that can tolerate data overflow by design while ensuring reasonable data reuse. We demonstrate that Tailors can be easily integrated into the memory hierarchy of an existing sparse tensor algebra accelerator. To ensure high buffer utilization with minimal tiling overhead, we introduce a statistical approach, Swiftiles, to pick a tile size so that tiles usually fit within the buffer's capacity, but can potentially overflow, i.e., it overbooks the buffers. Across a suite of 22 sparse tensor algebra workloads, we show that our proposed overbooking strategy introduces an average speedup of $52.7\times$ and $2.3\times$ and an average energy reduction of $22.5\times$ and $2.5\times$ over ExTensor without and with optimized tiling, respectively.
翻译:摘要:稀疏张量代数是一类因计算强度低且稀疏模式多变而难以加速的工作负载。先前的稀疏张量代数加速器探索了对稀疏数据进行分块以增加可开发的数据重用并提高吞吐量,但通常为给定缓冲区中的最坏情况数据占用分配分块大小。这严重限制了可用内存资源的利用率并减少了数据重用。其他加速器在预处理或运行时采用复杂的分块方法,根据数据占用量确定精确的分块大小。本文提出一种名为“超额预订”的推测性张量分块方法,通过利用稀疏张量中非零元素的分布来构建具有更大数据重用性的更大分块,从而提高缓冲区利用率。为确保正确性,我们提出一种低开销硬件机制Tailors,该机制通过设计能够容忍数据溢出,同时确保合理的数据重用。我们证明Tailors可轻松集成到现有稀疏张量代数加速器的内存层次结构中。为确保在最小化分块开销的同时实现高缓冲区利用率,我们引入一种统计方法Swiftiles来选择分块大小,使得分块通常适合缓冲区容量,但可能溢出,即对缓冲区进行超额预订。在22个稀疏张量代数工作负载的测试套件中,我们表明,与未使用优化分块的ExTensor相比,我们提出的超额预订策略平均加速比达52.7倍和2.3倍,平均能耗降低22.5倍和2.5倍。