A classical result of Johnson and Lindenstrauss states that a set of $n$ high dimensional data points can be projected down to $O(\log n/\epsilon^2)$ dimensions such that the square of their pairwise distances is preserved up to a small distortion $\epsilon\in(0,1)$. It has been proved that the JL lemma is optimal for the general case, therefore, improvements can only be explored for special cases. This work aims to improve the $\epsilon^{-2}$ dependency based on techniques inspired by the Hutch++ Algorithm, which reduces $\epsilon^{-2}$ to $\epsilon^{-1}$ for the related problem of implicit matrix trace estimation. We first present an algorithm to estimate the Euclidean lengths of the rows of a matrix. We prove for it element-wise probabilistic bounds that are at least as good as standard JL approximations in the worst-case, but are asymptotically better for matrices with decaying spectrum. Moreover, for any matrix, regardless of its spectrum, the algorithm achieves $\epsilon$-accuracy for the total, Frobenius norm-wise relative error using only $O(\epsilon^{-1})$ queries. This is a quadratic improvement over the norm-wise error of standard JL approximations. We also show how these results can be extended to estimate (i) the Euclidean distances between data points and (ii) the statistical leverage scores of tall-and-skinny data matrices, which are ubiquitous for many applications, with analogous theoretical improvements. Proof-of-concept numerical experiments are presented to validate the theoretical analysis.
翻译:Johnson和Lindenstrauss的一个经典结果表明,一组$n$个高维数据点可以投影到$O(\log n/\epsilon^2)$维空间中,使得其成对距离的平方在微小失真$\epsilon\in(0,1)$内得以保持。已证明JL引理在一般情况下是最优的,因此仅能在特殊情形下探索改进。本工作旨在基于Hutch++算法的启发技术改善$\epsilon^{-2}$依赖,该算法将隐式矩阵迹估计相关问题的$\epsilon^{-2}$降至$\epsilon^{-1}$。我们首先提出一种估计矩阵行欧几里得长度的算法。我们为其证明了逐元素概率界,在最坏情况下至少与标准JL近似相当,但对于谱衰减矩阵具有渐近更优性。此外,对于任意矩阵(无论其谱特性如何),该算法仅需$O(\epsilon^{-1})$次查询即可实现总Frobenius范数相对误差的$\epsilon$精度。这相对于标准JL近似的范数误差是二次改进。我们还展示了如何将这些结果推广至估计(i)数据点间的欧几里得距离,以及(ii)瘦高数据矩阵的统计杠杆分数(该问题在众多应用中普遍存在),并获得类似的理论改进。最后通过概念验证数值实验验证了理论分析。