Sparse tensor decomposition and completion are common in numerous applications, ranging from machine learning to computational quantum chemistry. Typically, the main bottleneck in optimization of these models are contractions of a single large sparse tensor with a network of several dense matrices or tensors (SpTTN). Prior works on high-performance tensor decomposition and completion have focused on performance and scalability optimizations for specific SpTTN kernels. We present algorithms and a runtime system for identifying and executing the most efficient loop nest for any SpTTN kernel. We consider both enumeration of such loop nests for autotuning and efficient algorithms for finding the lowest cost loop-nest for simpler metrics, such as buffer size or cache miss models. Our runtime system identifies the best choice of loop nest without user guidance, and also provides a distributed-memory parallelization of SpTTN kernels. We evaluate our framework using both real-world and synthetic tensors. Our results demonstrate that our approach outperforms available generalized state-of-the-art libraries and matches the performance of specialized codes.
翻译:稀疏张量分解与补全广泛应用于从机器学习到计算量子化学的众多领域。通常,这些模型优化的主要瓶颈在于单个大规模稀疏张量与多个稠密矩阵或张量网络的收缩(SpTTN)。先前关于高性能张量分解与补全的研究主要关注特定SpTTN核的性能与可扩展性优化。我们提出了算法与运行时系统,用于识别并执行任意SpTTN核的最高效循环嵌套。我们既考虑了用于自动调优的循环嵌套枚举方法,也针对缓冲区大小或缓存缺失模型等简单指标,设计了寻找最小代价循环嵌套的高效算法。我们的运行时系统无需用户指导即可自动识别最优循环嵌套选择,并实现了SpTTN核的分布式内存并行化。我们使用真实世界张量与合成张量对框架进行了评估。结果表明,我们的方法优于现有通用最先进的库,并与专用代码的性能相匹配。