In this paper, we propose a randomly projected convex clustering model for clustering a collection of $n$ high dimensional data points in $\mathbb{R}^d$ with $K$ hidden clusters. Compared to the convex clustering model for clustering original data with dimension $d$, we prove that, under some mild conditions, the perfect recovery of the cluster membership assignments of the convex clustering model, if exists, can be preserved by the randomly projected convex clustering model with embedding dimension $m = O(\epsilon^{-2}\log(n))$, where $0 < \epsilon < 1$ is some given parameter. We further prove that the embedding dimension can be improved to be $O(\epsilon^{-2}\log(K))$, which is independent of the number of data points. Extensive numerical experiment results will be presented in this paper to demonstrate the robustness and superior performance of the randomly projected convex clustering model. The numerical results presented in this paper also demonstrate that the randomly projected convex clustering model can outperform the randomly projected K-means model in practice.
翻译:本文提出了一种随机投影凸聚类模型,用于对 $\mathbb{R}^d$ 中具有 $K$ 个隐藏簇的 $n$ 个高维数据点进行聚类。与对原始 $d$ 维数据进行聚类的凸聚类模型相比,我们证明,在温和条件下,凸聚类模型若存在完美聚类成员分配恢复,则可通过嵌入维度 $m = O(\epsilon^{-2}\log(n))$ 的随机投影凸聚类模型保持该完美恢复,其中 $0 < \epsilon < 1$ 为给定参数。我们进一步证明,嵌入维度可改进为 $O(\epsilon^{-2}\log(K))$,该值与数据点数量无关。本文通过大量数值实验结果展示了随机投影凸聚类模型的鲁棒性与优越性能。本文的数值结果还表明,随机投影凸聚类模型在实践中优于随机投影K均值模型。