We address the communication overhead of distributed sparse matrix-(multiple)-vector multiplication in the context of large-scale eigensolvers, using filter diagonalization as an example. The basis of our study is a performance model which includes a communication metric that is computed directly from the matrix sparsity pattern without running any code. The performance model quantifies to which extent scalability and parallel efficiency are lost due to communication overhead. To restore scalability, we identify two orthogonal layers of parallelism in the filter diagonalization technique. In the horizontal layer the rows of the sparse matrix are distributed across individual processes. In the vertical layer bundles of multiple vectors are distributed across separate process groups. An analysis in terms of the communication metric predicts that scalability can be restored if, and only if, one implements the two orthogonal layers of parallelism via different distributed vector layouts. Our theoretical analysis is corroborated by benchmarks for application matrices from quantum and solid state physics, road networks, and nonlinear programming. We finally demonstrate the benefits of using orthogonal layers of parallelism with two exemplary application cases -- an exciton and a strongly correlated electron system -- which incur either small or large communication overhead.
翻译:我们以滤波对角化为例,研究大规模特征值求解器背景下分布式稀疏矩阵-(多)向量乘法的通信开销。本研究的基础是一个性能模型,该模型包含一个直接从矩阵稀疏模式计算(无需运行任何代码)的通信度量指标。该性能模型量化了由于通信开销导致的扩展性和并行效率损失程度。为恢复扩展性,我们在滤波对角化技术中识别出两个正交的并行层次:水平层将稀疏矩阵的行分布于各个进程,垂直层将多个向量的束分布于不同的进程组。基于通信度量的分析预测,仅当通过不同的分布式向量布局实现这两个正交并行层时,扩展性才能恢复。我们对量子与固体物理、道路网络及非线性规划领域的应用矩阵进行的基准测试验证了理论分析。最后,通过两个示例应用案例——激子体系和强关联电子体系(分别产生较小或较大的通信开销),我们展示了使用正交并行层的优势。