In this paper, we consider feature screening for ultrahigh dimensional clustering analyses. Based on the observation that the marginal distribution of any given feature is a mixture of its conditional distributions in different clusters, we propose to screen clustering features by independently evaluating the homogeneity of each feature's mixture distribution. Important cluster-relevant features have heterogeneous components in their mixture distributions and unimportant features have homogeneous components. The well-known EM-test statistic is used to evaluate the homogeneity. Under general parametric settings, we establish the tail probability bounds of the EM-test statistic for the homogeneous and heterogeneous features, and further show that the proposed screening procedure can achieve the sure independent screening and even the consistency in selection properties. Limiting distribution of the EM-test statistic is also obtained for general parametric distributions. The proposed method is computationally efficient, can accurately screen for important cluster-relevant features and help to significantly improve clustering, as demonstrated in our extensive simulation and real data analyses.
翻译:本文针对超高维聚类分析中的特征筛选问题展开研究。基于任一特征的边际分布均是其在不同聚类中条件分布的混合这一观察,我们提出通过独立评估各特征混合分布的同质性来进行聚类特征筛选。重要聚类相关特征的混合分布具有异质性成分,而非重要特征的混合分布则呈同质性。采用著名的EM检验统计量评估同质性。在一般参数设定下,我们推导了同质特征与异质特征的EM检验统计量尾部概率界,并进一步证明所提出的筛选程序可实现确定独立筛选特性及选择一致性。同时获得了适用于一般参数分布的EM检验统计量极限分布。该方法计算高效,能准确筛选重要聚类相关特征,显著提升聚类效果——这一结论在大量模拟实验与真实数据分析中均得到验证。