Deep clustering (DC) is often quoted to have a key advantage over $k$-means clustering. Yet, this advantage is often demonstrated using image datasets only, and it is unclear whether it addresses the fundamental limitations of $k$-means clustering. Deep Embedded Clustering (DEC) learns a latent representation via an autoencoder and performs clustering based on a $k$-means-like procedure, while the optimization is conducted in an end-to-end manner. This paper investigates whether the deep-learned representation has enabled DEC to overcome the known fundamental limitations of $k$-means clustering, i.e., its inability to discover clusters of arbitrary shapes, varied sizes and densities. Our investigations on DEC have a wider implication on deep clustering methods in general. Notably, none of these methods exploit the underlying data distribution. We uncover that a non-deep learning approach achieves the intended aim of deep clustering by making use of distributional information of clusters in a dataset to effectively address these fundamental limitations.
翻译:深度聚类常被宣称具有相对于$k$-均值聚类的关键优势。然而,这种优势通常仅通过图像数据集进行验证,且尚不清楚其是否真正解决了$k$-均值聚类的根本性局限。深度嵌入聚类通过自编码器学习潜在表示,并基于类$k$-均值过程执行聚类,同时以端到端方式进行优化。本文探究了深度学习获得的表示是否使DEC能够克服$k$-均值聚类的已知根本局限——即其无法发现任意形状、不同尺寸和密度的聚类。我们对DEC的研究对深度聚类方法具有更广泛的启示。值得注意的是,现有方法均未利用底层数据分布。我们发现,一种非深度学习方法通过利用数据集中聚类的分布信息来有效解决这些根本局限,从而实现了深度聚类的预期目标。