In this paper, we consider feature screening for ultrahigh dimensional clustering analyses. Based on the observation that the marginal distribution of any given feature is a mixture of its conditional distributions in different clusters, we propose to screen clustering features by independently evaluating the homogeneity of each feature's mixture distribution. Important cluster-relevant features have heterogeneous components in their mixture distributions and unimportant features have homogeneous components. The well-known EM-test statistic is used to evaluate the homogeneity. Under general parametric settings, we establish the tail probability bounds of the EM-test statistic for the homogeneous and heterogeneous features, and further show that the proposed screening procedure can achieve the sure independent screening and even the consistency in selection properties. Limiting distribution of the EM-test statistic is also obtained for general parametric distributions. The proposed method is computationally efficient, can accurately screen for important cluster-relevant features and help to significantly improve clustering, as demonstrated in our extensive simulation and real data analyses.
翻译:本文针对超高维聚类分析中的特征筛选问题进行研究。基于任意给定特征的边缘分布是其在各聚类中条件分布的混合这一观测,我们提出通过独立评估每个特征混合分布的同质性来筛选聚类特征。对聚类具有重要性的特征在其混合分布中包含异质性成分,而非重要性特征则具有同质性成分。采用著名的EM检验统计量来评估同质性。在一般参数设定下,我们推导了同质性与异质性特征对应EM检验统计量的尾概率界,进而证明所提出的筛选方法能够实现确定独立筛选性质乃至选择一致性。同时获得了广义参数分布下EM检验统计量的极限分布。该方法计算高效,能准确筛选对聚类重要的特征,并显著提升聚类效果——这一结论在大量模拟实验与真实数据分析中得到了充分验证。