Online deep clustering refers to the joint use of a feature extraction network and a clustering model to assign cluster labels to each new data point or batch as it is processed. While faster and more versatile than offline methods, online clustering can easily reach the collapsed solution where the encoder maps all inputs to the same point and all are put into a single cluster. Successful existing models have employed various techniques to avoid this problem, most of which require data augmentation or which aim to make the average soft assignment across the dataset the same for each cluster. We propose a method that does not require data augmentation, and that, differently from existing methods, regularizes the hard assignments. Using a Bayesian framework, we derive an intuitive optimization objective that can be straightforwardly included in the training of the encoder network. Tested on four image datasets and one human-activity recognition dataset, it consistently avoids collapse more robustly than other methods and leads to more accurate clustering. We also conduct further experiments and analyses justifying our choice to regularize the hard cluster assignments. Code is available at https://github.com/Lou1sM/online_hard_clustering.
翻译:在线深度聚类是指在处理每个新数据点或批次时,联合使用特征提取网络和聚类模型为其分配聚类标签。相较于离线方法,在线聚类速度更快且更具通用性,但容易陷入崩溃解——即编码器将所有输入映射到同一点,导致所有样本被归为同一聚类。现有成功模型采用多种技术避免此问题,多数方法依赖数据增强,或旨在使整个数据集上每个聚类的平均软分配保持均匀。我们提出一种无需数据增强的方法,与现有方法不同,该方法对硬分配进行正则化。基于贝叶斯框架,我们推导出一个直观的优化目标,可直接纳入编码器网络的训练中。在四个图像数据集和一个人类活动识别数据集上的测试表明,该方法比其它方法更稳健地避免崩溃,并能实现更准确的聚类。我们还进行了进一步的实验和分析,以验证我们对硬聚类分配进行正则化的选择。代码可在https://github.com/Lou1sM/online_hard_clustering获取。