The importance of variable selection for clustering has been recognized for some time, and mixture models are well-established as a statistical approach to clustering. Yet, the literature on variable selection in model-based clustering remains largely rooted in the assumption of Gaussian clusters. Unsurprisingly, variable selection algorithms based on this assumption tend to break down in the presence of cluster skewness. A novel variable selection algorithm is presented that utilizes the Manly transformation mixture model to select variables based on their ability to separate clusters, and is effective even when clusters depart from the Gaussian assumption. The proposed approach, which is implemented within the R package vscc, is compared to existing variable selection methods -- including an existing method that can account for cluster skewness -- using simulated and real datasets.
翻译:变量选择在聚类中的重要性早已得到广泛认可,而混合模型作为基于统计的聚类方法已趋于成熟。然而,基于模型聚类的变量选择文献仍主要根植于高斯聚类的假设。不出所料,基于该假设的变量选择算法在面对聚类偏态分布时往往失效。本文提出一种新颖的变量选择算法,该算法利用Manly变换混合模型,依据变量对聚类的区分能力进行筛选,即使在聚类偏离高斯假设的情况下依然有效。所提出的方法已在R包vscc中实现,并通过模拟数据集与真实数据集,与现有变量选择方法(包括一种能够处理聚类偏态分布的现有方法)进行了比较。