The peak performance of any SpMV depends primarily on the available memory bandwidth and its effective use. GPUs, ASICs, and new FPGAs have higher and higher bandwidth; however, for large scale and highly sparse matrices, SpMV is still a hard problem because of its random access pattern and workload imbalance. Here, we show how to turn randomness to our advantage. We propose a matrix permutation pre-processing step that aims to maximize the entropy of the distribution of the nonzero elements. We seek any permutation that uniformly distributes the non-zero elements' distribution, thereby generating a SpMV problem that is amenable to work load balancing or to speed up sort algorithms. We conjecture these permutations would be most effective for matrices with no dense rows or columns and, as in preconditioning, when the matrix is reused. We shall show that entropy maximization is an optimization that any architecture may take advantage although in different ways. Most importantly, any developer can consider and deploy. We shall present cases where we can improve performance by 15\% on AMD-based (GPU-CPU) systems.
翻译:任何 SpMV 的峰值性能主要取决于可用内存带宽及其有效利用。GPU、ASIC 和新型 FPGA 的带宽越来越高;然而,对于大规模且高度稀疏的矩阵,由于其随机访问模式和工作负载不平衡,SpMV 仍然是一个难题。在此,我们展示了如何将随机性转化为优势。我们提出了一种矩阵排列预处理步骤,旨在最大化非零元素分布的熵。我们寻找能够均匀分布非零元素分布的任意排列,从而生成一个易于实现工作负载平衡或加速排序算法的 SpMV 问题。我们推测,这些排列对于没有稠密行或列的矩阵以及在预处理(如矩阵复用)时最为有效。我们将证明熵最大化是一种优化,任何体系结构都可以利用它,尽管方式不同。最重要的是,任何开发者都可以考虑并部署它。我们将展示在基于 AMD(GPU-CPU)的系统中性能提升可达 15% 的案例。