Online deep clustering refers to the joint use of a feature extraction network and a clustering model to assign cluster labels to each new data point or batch as it is processed. While faster and more versatile than offline methods, online clustering can easily reach the collapsed solution where the encoder maps all inputs to the same point and all are put into a single cluster. Successful existing models have employed various techniques to avoid this problem, most of which require data augmentation or which aim to make the average soft assignment across the dataset the same for each cluster. We propose a method that does not require data augmentation, and that, differently from existing methods, regularizes the hard assignments. Using a Bayesian framework, we derive an intuitive optimization objective that can be straightforwardly included in the training of the encoder network. Tested on four image datasets, we show that it consistently avoids collapse more robustly than other methods and that it leads to more accurate clustering. We also conduct further experiments and analyses justifying our choice to regularize the hard cluster assignments.
翻译:在线深度聚类是指联合使用特征提取网络和聚类模型,为每个处理中的新数据点或批次分配聚类标签。尽管比离线方法更快且更灵活,但在线聚类容易陷入崩塌解——编码器将所有输入映射到同一点,所有数据归入单一聚类。现有成功模型采用多种技术避免该问题,其中大多数需要数据增强,或旨在使数据集的平均软分配在各聚类间保持一致。我们提出一种无需数据增强的方法,与现有方法不同,该方法对硬分配进行正则化。利用贝叶斯框架,我们推导出一个直观的优化目标函数,可简便地纳入编码器网络的训练过程中。在四个图像数据集上的实验表明,该方法比其它方法更稳健地避免崩塌,并实现更精确的聚类。我们还通过进一步实验与分析证明了对硬聚类分配进行正则化选择的合理性。