Sketching algorithms have recently proven to be a powerful approach both for designing low-space streaming algorithms as well as fast polynomial time approximation schemes (PTAS). In this work, we develop new techniques to extend the applicability of sketching-based approaches to the sparse dictionary learning and the Euclidean $k$-means clustering problems. In particular, we initiate the study of the challenging setting where the dictionary/clustering assignment for each of the $n$ input points must be output, which has surprisingly received little attention in prior work. On the fast algorithms front, we obtain a new approach for designing PTAS's for the $k$-means clustering problem, which generalizes to the first PTAS for the sparse dictionary learning problem. On the streaming algorithms front, we obtain new upper bounds and lower bounds for dictionary learning and $k$-means clustering. In particular, given a design matrix $\mathbf A\in\mathbb R^{n\times d}$ in a turnstile stream, we show an $\tilde O(nr/\epsilon^2 + dk/\epsilon)$ space upper bound for $r$-sparse dictionary learning of size $k$, an $\tilde O(n/\epsilon^2 + dk/\epsilon)$ space upper bound for $k$-means clustering, as well as an $\tilde O(n)$ space upper bound for $k$-means clustering on random order row insertion streams with a natural "bounded sensitivity" assumption. On the lower bounds side, we obtain a general $\tilde\Omega(n/\epsilon + dk/\epsilon)$ lower bound for $k$-means clustering, as well as an $\tilde\Omega(n/\epsilon^2)$ lower bound for algorithms which can estimate the cost of a single fixed set of candidate centers.
翻译:草图算法近期已被证明是设计低空间流算法以及快速多项式时间近似方案(PTAS)的强大方法。在本工作中,我们开发新技术以扩展基于草图的算法在稀疏字典学习和欧几里得$k$-均值聚类问题中的适用性。特别地,我们首次研究了具有挑战性的设置:必须输出每个$n$个输入点的字典/聚类分配,该问题在先前工作中意外地鲜受关注。在快速算法方面,我们获得了为$k$-均值聚类问题设计PTAS的新方法,该方法可推广至首个针对稀疏字典学习问题的PTAS。在流算法方面,我们获得了字典学习和$k$-均值聚类的新的上界与下界。具体而言,给定一个转门流中的设计矩阵$\mathbf A\in\mathbb R^{n\times d}$,我们展示了大小为$k$的$r$-稀疏字典学习在$\tilde O(nr/\epsilon^2 + dk/\epsilon)$空间上界,$k$-均值聚类的$\tilde O(n/\epsilon^2 + dk/\epsilon)$空间上界,以及在具有自然“有界灵敏度”假设的随机顺序行插入流中$k$-均值聚类的$\tilde O(n)$空间上界。在下界方面,我们获得了$k$-均值聚类的通用$\tilde\Omega(n/\epsilon + dk/\epsilon)$下界,以及能够估计单个固定候选中心集成本的算法的$\tilde\Omega(n/\epsilon^2)$下界。