This paper studies a factor modeling-based approach for clustering high-dimensional data generated from a mixture of strongly correlated variables. Statistical modeling with correlated structures pervades modern applications in economics, finance, genomics, wireless sensing, etc., with factor modeling being one of the popular techniques for explaining the common dependence. Standard techniques for clustering high-dimensional data, e.g., naive spectral clustering, often fail to yield insightful results as their performances heavily depend on the mixture components having a weakly correlated structure. To address the clustering problem in the presence of a latent factor model, we propose the Factor Adjusted Spectral Clustering (FASC) algorithm, which uses an additional data denoising step via eliminating the factor component to cope with the data dependency. We prove this method achieves an exponentially low mislabeling rate, with respect to the signal to noise ratio under a general set of assumptions. Our assumption bridges many classical factor models in the literature, such as the pervasive factor model, the weak factor model, and the sparse factor model. The FASC algorithm is also computationally efficient, requiring only near-linear sample complexity with respect to the data dimension. We also show the applicability of the FASC algorithm with real data experiments and numerical studies, and establish that FASC provides significant results in many cases where traditional spectral clustering fails.
翻译:本文研究了一种基于因子建模的方法,用于聚类由强相关变量混合生成的高维数据。具有相关结构的统计建模广泛存在于经济学、金融学、基因组学、无线传感等现代应用中,而因子建模是解释共同依赖性的流行技术之一。传统的高维数据聚类技术(例如朴素谱聚类)往往无法得出有意义的结论,因为其性能严重依赖于混合成分具有弱相关结构。为了解决存在潜在因子模型时的聚类问题,我们提出了因子调整谱聚类(FASC)算法,该算法通过消除因子成分进行额外的数据去噪步骤,以应对数据依赖性。我们证明,在一般性假设条件下,该方法相对于信噪比实现了指数级低的错误标记率。我们的假设连接了文献中的许多经典因子模型,例如普遍因子模型、弱因子模型和稀疏因子模型。FASC算法在计算上也是高效的,仅需相对于数据维度接近线性的样本复杂度。我们还通过真实数据实验和数值研究展示了FASC算法的适用性,并证实FASC在许多传统谱聚类失效的情况下提供了显著的结果。