Sparse tensor decomposition and completion are common in numerous applications, ranging from machine learning to computational quantum chemistry. Typically, the main bottleneck in optimization of these models are contractions of a single large sparse tensor with a network of several dense matrices or tensors (SpTTN). Prior works on high-performance tensor decomposition and completion have focused on performance and scalability optimizations for specific SpTTN kernels. We present algorithms and a runtime system for identifying and executing the most efficient loop nest for any SpTTN kernel. We consider both enumeration of such loop nests for autotuning and efficient algorithms for finding the lowest cost loop-nest for simpler metrics, such as buffer size or cache miss models. Our runtime system identifies the best choice of loop nest without user guidance, and also provides a distributed-memory parallelization of SpTTN kernels. We evaluate our framework using both real-world and synthetic tensors. Our results demonstrate that our approach outperforms available generalized state-of-the-art libraries and matches the performance of specialized codes.
翻译:稀疏张量分解与补全在从机器学习到计算量子化学的众多应用中十分常见。通常,这些模型优化的主要瓶颈在于单个大型稀疏张量与多个稠密矩阵或张量构成的网络(SpTTN)的收缩运算。先前关于高性能张量分解与补全的研究主要集中在针对特定SpTTN内核的性能与可扩展性优化。本文提出了用于识别并执行任意SpTTN内核最高效循环嵌套的算法与运行时系统。我们既考虑了此类循环嵌套的枚举以进行自动调优,也探讨了针对更简单度量(如缓冲区大小或缓存缺失模型)寻找最低代价循环嵌套的高效算法。我们的运行时系统无需用户指导即可识别最优循环嵌套选择,并提供了SpTTN内核的分布式内存并行化方案。我们使用真实世界与合成张量评估了所提框架。实验结果表明,我们的方法优于现有通用型先进库,并能达到专用代码的性能水平。