Unsupervised learning has gained prominence in the big data era, offering a means to extract valuable insights from unlabeled datasets. Deep clustering has emerged as an important unsupervised category, aiming to exploit the non-linear mapping capabilities of neural networks in order to enhance clustering performance. The majority of deep clustering literature focuses on minimizing the inner-cluster variability in some embedded space while keeping the learned representation consistent with the original high-dimensional dataset. In this work, we propose soft silhoutte, a probabilistic formulation of the silhouette coefficient. Soft silhouette rewards compact and distinctly separated clustering solutions like the conventional silhouette coefficient. When optimized within a deep clustering framework, soft silhouette guides the learned representations towards forming compact and well-separated clusters. In addition, we introduce an autoencoder-based deep learning architecture that is suitable for optimizing the soft silhouette objective function. The proposed deep clustering method has been tested and compared with several well-studied deep clustering methods on various benchmark datasets, yielding very satisfactory clustering results.
翻译:无监督学习在大数据时代愈发重要,为从无标签数据集中提取有价值信息提供了途径。深度聚类作为无监督学习的重要分支,旨在利用神经网络的非线性映射能力来提升聚类性能。现有深度聚类研究大多致力于最小化嵌入空间中的类内变异性,同时保持学习到的表示与原始高维数据集的一致性。本文提出软轮廓系数(soft silhouette)——一种轮廓系数的概率化表述。与传统轮廓系数类似,软轮廓系数奖励紧凑且分离良好的聚类结果。在深度聚类框架中优化软轮廓系数时,可引导学习到的表示形成紧凑且分离良好的聚类。此外,我们提出一种适用于优化软轮廓系数目标函数的自编码器深度学习架构。该深度聚类方法已在多个基准数据集上与多种成熟的深度聚类方法进行对比测试,取得了令人满意的聚类结果。