Compared to supervised variable selection, the research on unsupervised variable selection is far behind. A forward partial-variable clustering full-variable loss (FPCFL) method is proposed for the corresponding challenges. An advantage is that the FPCFL method can distinguish active, redundant, and uninformative variables, which the previous methods cannot achieve. Theoretical and simulation studies show that the performance of a clustering method using all the variables can be worse if many uninformative variables are involved. Better results are expected if the uninformative variables are excluded. The research addresses a previous concern about how variable selection affects the performance of clustering. Rather than many previous methods attempting to select all the relevant variables, the proposed method selects a subset that can induce an equally good result. This phenomenon does not appear in the supervised variable selection problems.
翻译:与监督变量选择相比,无监督变量选择的研究远远落后。针对相应挑战,本文提出了一种前向部分变量聚类全变量损失(FPCFL)方法。其优势在于,FPCFL方法能够区分活跃变量、冗余变量和无信息变量,这是以往方法无法实现的。理论与仿真研究表明,若涉及大量无信息变量,使用全部变量的聚类方法性能可能更差。若排除无信息变量,则可预期获得更好结果。本研究回应了先前关于变量选择如何影响聚类性能的关切。与许多先前试图选择所有相关变量的方法不同,所提方法选择的子集能够产生同等良好的结果。这种现象在监督变量选择问题中并未出现。