Matrix sketching, aimed at approximating a matrix $\boldsymbol{A} \in \mathbb{R}^{N\times d}$ consisting of vector streams of length $N$ with a smaller sketching matrix $\boldsymbol{B} \in \mathbb{R}^{\ell\times d}, \ell \ll N$, has garnered increasing attention in fields such as large-scale data analytics and machine learning. A well-known deterministic matrix sketching method is the Frequent Directions algorithm, which achieves the optimal $O\left(\frac{d}{\varepsilon}\right)$ space bound and provides a covariance error guarantee of $\varepsilon = \lVert \boldsymbol{A}^\top \boldsymbol{A} - \boldsymbol{B}^\top \boldsymbol{B} \rVert_2/\lVert \boldsymbol{A} \rVert_F^2$. The matrix sketching problem becomes particularly interesting in the context of sliding windows, where the goal is to approximate the matrix $\boldsymbol{A}_W$, formed by input vectors over the most recent $N$ time units. However, despite recent efforts, whether achieving the optimal $O\left(\frac{d}{\varepsilon}\right)$ space bound on sliding windows is possible has remained an open question. In this paper, we introduce the DS-FD algorithm, which achieves the optimal $O\left(\frac{d}{\varepsilon}\right)$ space bound for matrix sketching over row-normalized, sequence-based sliding windows. We also present matching upper and lower space bounds for time-based and unnormalized sliding windows, demonstrating the generality and optimality of \dsfd across various sliding window models. This conclusively answers the open question regarding the optimal space bound for matrix sketching over sliding windows. Furthermore, we conduct extensive experiments with both synthetic and real-world datasets, validating our theoretical claims and thus confirming the correctness and effectiveness of our algorithm, both theoretically and empirically.
翻译:矩阵素描旨在用较小的素描矩阵 $\boldsymbol{B} \in \mathbb{R}^{\ell\times d}, \ell \ll N$ 来近似由长度为 $N$ 的向量流构成的矩阵 $\boldsymbol{A} \in \mathbb{R}^{N\times d}$,该技术在大规模数据分析与机器学习等领域受到日益广泛的关注。一种著名的确定性矩阵素描方法是Frequent Directions算法,它达到了最优的 $O\left(\frac{d}{\varepsilon}\right)$ 空间界,并提供了协方差误差保证 $\varepsilon = \lVert \boldsymbol{A}^\top \boldsymbol{A} - \boldsymbol{B}^\top \boldsymbol{B} \rVert_2/\lVert \boldsymbol{A} \rVert_F^2$。在滑动窗口的背景下,矩阵素描问题变得尤为有趣,其目标在于近似由最近 $N$ 个时间单元的输入向量构成的矩阵 $\boldsymbol{A}_W$。然而,尽管近期已有研究努力,在滑动窗口上能否达到最优的 $O\left(\frac{d}{\varepsilon}\right)$ 空间界仍然是一个悬而未决的问题。本文中,我们提出了DS-FD算法,该算法在行归一化的、基于序列的滑动窗口上实现了矩阵素描的最优 $O\left(\frac{d}{\varepsilon}\right)$ 空间界。我们还针对基于时间的滑动窗口和未归一化的滑动窗口给出了匹配的空间上界与下界,从而证明了\dsfd算法在各种滑动窗口模型中的普适性与最优性。这最终解答了关于滑动窗口上矩阵素描最优空间界的开放性问题。此外,我们利用合成数据集和真实世界数据集进行了广泛的实验,验证了我们的理论主张,从而在理论和实证两方面确认了我们算法的正确性与有效性。