Subspace clustering methods which embrace a self-expressive model that represents each data point as a linear combination of other data points in the dataset provide powerful unsupervised learning techniques. However, when dealing with large datasets, representation of each data point by referring to all data points via a dictionary suffers from high computational complexity. To alleviate this issue, we introduce a parallelizable multi-subset based self-expressive model (PMS) which represents each data point by combining multiple subsets, with each consisting of only a small proportion of the samples. The adoption of PMS in subspace clustering (PMSSC) leads to computational advantages because the optimization problems decomposed over each subset are small, and can be solved efficiently in parallel. Furthermore, PMSSC is able to combine multiple self-expressive coefficient vectors obtained from subsets, which contributes to an improvement in self-expressiveness. Extensive experiments on synthetic and real-world datasets show the efficiency and effectiveness of our approach in comparison to other methods.
翻译:子空间聚类方法采用自表达模型,将每个数据点表示为数据集中其他数据点的线性组合,提供了强大的无监督学习技术。然而,在处理大规模数据集时,通过字典引用所有数据点来表示每个数据点的方法面临高计算复杂度的挑战。为解决此问题,我们提出了一种可并行的多子集自表达模型(PMS),该模型通过组合多个子集来表示每个数据点,每个子集仅包含少量样本。将PMS应用于子空间聚类(PMSSC)可带来计算优势,因为分解到每个子集上的优化问题规模较小,且能够高效并行求解。此外,PMSSC能够整合来自多个子集的自表达系数向量,从而提升自表达能力。在合成数据集和真实数据集上的大量实验表明,与其他方法相比,我们的方法在效率和有效性上具有优势。