We consider the problem of simultaneously clustering and learning a linear representation of data lying close to a union of low-dimensional manifolds, a fundamental task in machine learning and computer vision. When the manifolds are assumed to be linear subspaces, this reduces to the classical problem of subspace clustering, which has been studied extensively over the past two decades. Unfortunately, many real-world datasets such as natural images can not be well approximated by linear subspaces. On the other hand, numerous works have attempted to learn an appropriate transformation of the data, such that data is mapped from a union of general non-linear manifolds to a union of linear subspaces (with points from the same manifold being mapped to the same subspace). However, many existing works have limitations such as assuming knowledge of the membership of samples to clusters, requiring high sampling density, or being shown theoretically to learn trivial representations. In this paper, we propose to optimize the Maximal Coding Rate Reduction metric with respect to both the data representation and a novel doubly stochastic cluster membership, inspired by state-of-the-art subspace clustering results. We give a parameterization of such a representation and membership, allowing efficient mini-batching and one-shot initialization. Experiments on CIFAR-10, -20, -100, and TinyImageNet-200 datasets show that the proposed method is much more accurate and scalable than state-of-the-art deep clustering methods, and further learns a latent linear representation of the data.
翻译:我们考虑同时对近似位于低维流形并集附近的数据进行聚类与学习线性表示的问题,这是机器学习与计算机视觉中的一项基础任务。当假定这些流形为线性子空间时,该问题退化为子空间聚类这一经典问题,并在过去二十年中得到了广泛研究。然而,许多真实世界数据集(如自然图像)难以通过线性子空间进行良好近似。另一方面,大量研究尝试学习数据的适当变换,使得数据从一般非线性流形的并集映射到线性子空间的并集(来自同一流形的点被映射至同一子空间)。但现有许多方法存在局限性,例如假设已知样本的聚类归属、要求高采样密度,或在理论上被证明只能学习到平凡表示。本文受最新子空间聚类结果的启发,提出同时针对数据表示与新型双随机聚类归属优化最大编码率缩减指标。我们对该表示与归属进行了参数化,从而支持高效的小批量处理与单步初始化。在CIFAR-10、CIFAR-20、CIFAR-100及TinyImageNet-200数据集上的实验表明,所提出方法比现有深度聚类方法具有更高的准确性与可扩展性,并可进一步学习数据的潜在线性表示。