We employ pressure point analysis and roofline modeling to identify performance bottlenecks and determine an upper bound on the performance of the Canonical Polyadic Alternating Poisson Regression Multiplicative Update (CP-APR MU) algorithm in the SparTen software library. Our analyses reveal that a particular matrix computation, $\Phi^{(n)}$, is the critical performance bottleneck in the SparTen CP-APR MU implementation. Moreover, we find that atomic operations are not a critical bottleneck while higher cache reuse can provide a non-trivial performance improvement. We also utilize grid search on the Kokkos library parallel policy parameters to achieve 2.25x average speedup over the SparTen default for $\Phi^{(n)}$ computation on CPU and 1.70x on GPU. We conclude our investigations by comparing Kokkos implementations of the STREAM benchmark and the matricized tensor times Khatri-Rao product (MTTKRP) benchmark from the Parallel Sparse Tensor Algorithm (PASTA) benchmark suite to implementations using vendor libraries. We show that with a single implementation Kokkos achieves performance comparable to hand-tuned code for fundamental operations that make up tensor decomposition kernels on a wide range of CPU and GPU systems. Overall, we conclude that Kokkos demonstrates good performance portability for simple data-intensive operations but requires tuning for algorithms with more complex dependencies and data access patterns.
翻译:我们采用压力点分析和屋顶线模型来识别性能瓶颈,并确定SparTen软件库中Canonical Polyadic交替泊松回归乘法更新(CP-APR MU)算法的性能上界。分析表明,特定矩阵计算$\Phi^{(n)}$是SparTen CP-APR MU实现中的关键性能瓶颈。此外,我们发现原子操作并非关键瓶颈,而更高的缓存重用可带来显著性能提升。我们还利用Kokkos库并行策略参数的网格搜索,在CPU上对$\Phi^{(n)}$计算实现了相较于SparTen默认值平均2.25倍的加速,在GPU上实现了1.70倍加速。通过将并行稀疏张量算法(PASTA)基准测试套件中的STREAM基准测试和张量化张量乘以Khatri-Rao积(MTTKRP)基准测试的Kokkos实现与供应商库实现进行对比,我们得出结论:Kokkos通过单一实现即可在广泛的CPU和GPU系统上达到与手工调优代码相当的张量分解核心运算性能。总体而言,我们认为Kokkos对简单的数据密集型运算展现出良好的性能可移植性,但对于依赖关系与数据访问模式更复杂的算法,仍需进行调优。