We present a data structure to randomly sample rows from the Khatri-Rao product of several matrices according to the exact distribution of its leverage scores. Our proposed sampler draws each row in time logarithmic in the height of the Khatri-Rao product and quadratic in its column count, with persistent space overhead at most the size of the input matrices. As a result, it tractably draws samples even when the matrices forming the Khatri-Rao product have tens of millions of rows each. When used to sketch the linear least squares problems arising in CANDECOMP / PARAFAC tensor decomposition, our method achieves lower asymptotic complexity per solve than recent state-of-the-art methods. Experiments on billion-scale sparse tensors validate our claims, with our algorithm achieving higher accuracy than competing methods as the decomposition rank grows.
翻译:我们提出了一种数据结构,用于根据精确的杠杆得分分布从多个矩阵的Khatri-Rao积中随机采样行。所提出的采样器以Khatri-Rao积高度的对数时间和列数的二次时间抽取每一行,持久空间开销不超过输入矩阵的规模。因此,即使构成Khatri-Rao积的矩阵每行有数千万个,该方法仍能高效地进行采样。当用于对CANDECOMP / PARAFAC张量分解中的线性最小二乘问题进行草图化时,我们的方法每次求解的渐近复杂度低于近期最新方法。在十亿级稀疏张量上的实验验证了我们的结论:随着分解秩的增加,该算法比竞争方法实现了更高的精度。