Feature selection in clustering is a hard task which involves simultaneously the discovery of relevant clusters as well as relevant variables with respect to these clusters. While feature selection algorithms are often model-based through optimised model selection or strong assumptions on $p(\pmb{x})$, we introduce a discriminative clustering model trying to maximise a geometry-aware generalisation of the mutual information called GEMINI with a simple $\ell_1$ penalty: the Sparse GEMINI. This algorithm avoids the burden of combinatorial feature subset exploration and is easily scalable to high-dimensional data and large amounts of samples while only designing a clustering model $p_\theta(y|\pmb{x})$. We demonstrate the performances of Sparse GEMINI on synthetic datasets as well as large-scale datasets. Our results show that Sparse GEMINI is a competitive algorithm and has the ability to select relevant subsets of variables with respect to the clustering without using relevance criteria or prior hypotheses.
翻译:聚类中的特征选择是一项困难任务,需同时发现相关聚类及其对应的相关变量。尽管特征选择算法通常基于模型优化或对$p(\pmb{x})$的强假设进行模型选择,但我们引入了一种判别聚类模型,通过简单的$\ell_1$惩罚项最大化称为GEMINI的几何感知互信息推广形式:即稀疏GEMINI。该算法避免了组合式特征子集探索的负担,能轻松扩展至高维数据和大量样本,且仅需设计聚类模型$p_\theta(y|\pmb{x})$。我们在合成数据集和大规模数据集上验证了稀疏GEMINI的性能。结果表明,稀疏GEMINI是一种具有竞争力的算法,无需使用相关性准则或先验假设,即可选择与聚类相关的变量子集。