Matrix decompositions are ubiquitous in machine learning, including applications in dimensionality reduction, data compression and deep learning algorithms. Typical solutions for matrix decompositions have polynomial complexity which significantly increases their computational cost and time. In this work, we leverage efficient processing operations that can be run in parallel on modern Graphical Processing Units (GPUs), predominant computing architecture used e.g. in deep learning, to reduce the computational burden of computing matrix decompositions. More specifically, we reformulate the randomized decomposition problem to incorporate fast matrix multiplication operations (BLAS-3) as building blocks. We show that this formulation, combined with fast random number generators, allows to fully exploit the potential of parallel processing implemented in GPUs. Our extensive evaluation confirms the superiority of this approach over the competing methods and we release the results of this research as a part of the official CUDA implementation (https://docs.nvidia.com/cuda/cusolver/index.html).
翻译:矩阵分解在机器学习中无处不在,包括降维、数据压缩和深度学习算法等应用。典型的矩阵分解解决方案具有多项式复杂度,这显著增加了计算成本和耗时。在本文中,我们利用现代图形处理单元(GPU)(例如深度学习中常用的主要计算架构)上可并行运行的高效处理操作,来降低矩阵分解的计算负担。具体而言,我们重新构建了随机分解问题,以将快速矩阵乘法操作(BLAS—3)作为基础模块。我们证明,这种构建方法结合快速随机数生成器,能够充分利用GPU中实现的并行处理的潜力。广泛的评估证实了该方法优于现有方法,并且我们将研究成果作为官方CUDA实现的一部分发布(https://docs.nvidia.com/cuda/cusolver/index.html)。