Automated code generation and performance optimizations for sparse tensor algebra are cardinal since they have become essential in many real-world applications like quantum computing, physics, chemistry, and machine learning. General sparse tensor algebra compilers are not always versatile enough to generate asymptotically optimal code for sparse tensor contractions. This paper shows how to optimize and generate asymptotically better schedules for complex tensor expressions using kernel fission and fusion. We present a generalized loop transformation to achieve loop nesting for minimized memory footprint and reduced asymptotic complexity. Furthermore, we present an auto-scheduler that uses a partially ordered set-based cost model that uses both time and auxiliary memory complexities in its pruning stages. In addition, we highlight the use of SMT solvers in sparse auto-schedulers to prune the Pareto frontier of schedules to the smallest number of possible schedules with user-defined constraints available at compile time. Finally, we show that our auto-scheduler can select asymptotically better schedules that use our compiler transformation to generate optimized code. Our results show that the auto-scheduler achieves orders of magnitude speedup compared to the TACO-generated code for several real-world tensor algebra computations on different real-world inputs.
翻译:自动化代码生成与稀疏张量代数的性能优化至关重要,因为它们在量子计算、物理、化学和机器学习等众多实际应用中已成为核心要素。通用稀疏张量代数编译器并非总能灵活地生成稀疏张量缩并的渐近最优代码。本文展示了如何利用核分裂与核融合技术,为复杂张量表达式优化并生成渐近更优的调度方案。我们提出一种通用循环变换方法,通过实现循环嵌套来最小化内存占用并降低渐近复杂度。进一步地,我们设计了一种基于偏序集代价模型的自动调度器,该模型在其剪枝阶段同时采用时间复杂度和辅助内存复杂度。此外,本文强调了在稀疏自动调度器中利用SMT求解器的作用,通过编译时可用的用户自定义约束,将帕累托最优调度集剪枝至尽可能少的可行调度方案。最后,我们证明所提出的自动调度器能够选择渐近更优的调度方案,并利用我们提出的编译器变换生成优化代码。实验结果表明,针对多种实际输入上的真实张量代数计算任务,该自动调度器相比TACO生成的代码实现了数个数量级的加速比。