In this paper, a parallel symmetric eigensolver with very small matrices in massively parallel processing is considered. We define very small matrices that fit the sizes of caches per node in a supercomputer. We assume that the sizes also fit the exa-scale computing requirements of current production runs of an application. To minimize communication time, we added several communication avoiding and communication reducing algorithms based on Message Passing Interface (MPI) non-blocking implementations. A performance evaluation with up to full nodes of the FX10 system indicates that (1) the MPI non-blocking implementation is 3x as efficient as the baseline implementation, (2) the hybrid MPI execution is 1.9x faster than the pure MPI execution, (3) our proposed solver is 2.3x and 22x faster than a ScaLAPACK routine with optimized blocking size and cyclic-cyclic distribution, respectively.
翻译:本文考虑在巨量并行计算中处理极小矩阵的并行对称特征值求解器。我们将极小矩阵定义为适合超级计算机各节点缓存大小的矩阵,并假设该尺寸同时满足当前应用程序实际运行中百亿亿次计算的需求。为最小化通信时间,我们基于消息传递接口(MPI)非阻塞实现添加了多种避免通信与减少通信的算法。基于FX10系统全节点的性能评估表明:(1)MPI非阻塞实现效率是基线实现的3倍;(2)混合MPI执行速度相比纯MPI执行提升1.9倍;(3)与采用优化分块尺寸及循环-循环分布的ScaLAPACK例程相比,本文提出的求解器分别实现2.3倍和22倍加速。