Automated code generation and performance optimizations for sparse tensor algebra are cardinal since they have become essential in many real-world applications like quantum computing, physics, chemistry, and machine learning. General sparse tensor algebra compilers are not always versatile enough to generate asymptotically optimal code for sparse tensor contractions. This paper shows how to optimize and generate asymptotically better schedules for complex tensor expressions using kernel fission and fusion. We present a generalized loop transformation to achieve loop nesting for minimized memory footprint and reduced asymptotic complexity. Furthermore, we present an auto-scheduler that uses a partially ordered set-based cost model that uses both time and auxiliary memory complexities in its pruning stages. In addition, we highlight the use of SMT solvers in sparse auto-schedulers to prune the Pareto frontier of schedules to the smallest number of possible schedules with user-defined constraints available at compile time. Finally, we show that our auto-scheduler can select asymptotically better schedules that use our compiler transformation to generate optimized code. Our results show that the auto-scheduler achieves orders of magnitude speedup compared to the TACO-generated code for several real-world tensor algebra computations on different real-world inputs.
翻译:稀疏张量代数的自动化代码生成与性能优化至关重要,因其在量子计算、物理、化学和机器学习等众多实际应用中已成为核心需求。通用稀疏张量代数编译器在生成渐近最优的稀疏张量收缩代码时往往缺乏足够的灵活性。本文展示了如何通过内核裂变与融合技术,为复杂张量表达式优化并生成渐近更优的调度方案。我们提出一种广义循环变换方法,通过实现循环嵌套来最小化内存占用并降低渐近复杂度。此外,我们设计了一种自动调度器,采用基于偏序集的代价模型,在剪枝阶段同时考虑时间与辅助内存复杂度。进一步地,我们强调了在稀疏自动调度器中利用SMT求解器,在编译时根据用户定义约束将帕累托前沿调度方案缩减至最小可能数量。最后,我们证明了该自动调度器能够选择渐近更优的调度方案,并借助我们的编译器变换生成优化代码。实验表明,针对多个真实世界输入上的实际张量代数计算,该自动调度器相比TACO生成的代码可实现数个数量级的加速。