Clustering has been a major research topic in the field of machine learning, one to which Deep Learning has recently been applied with significant success. However, an aspect of clustering that is not addressed by existing deep clustering methods, is that of efficiently producing multiple, diverse partitionings for a given dataset. This is particularly important, as a diverse set of base clusterings are necessary for consensus clustering, which has been found to produce better and more robust results than relying on a single clustering. To address this gap, we propose DivClust, a diversity controlling loss that can be incorporated into existing deep clustering frameworks to produce multiple clusterings with the desired degree of diversity. We conduct experiments with multiple datasets and deep clustering frameworks and show that: a) our method effectively controls diversity across frameworks and datasets with very small additional computational cost, b) the sets of clusterings learned by DivClust include solutions that significantly outperform single-clustering baselines, and c) using an off-the-shelf consensus clustering algorithm, DivClust produces consensus clustering solutions that consistently outperform single-clustering baselines, effectively improving the performance of the base deep clustering framework.
翻译:聚类一直是机器学习领域的主要研究课题,深度学习最近被成功应用于该领域并取得了显著成果。然而,现有深度聚类方法未解决的一个方面是,如何针对给定数据集高效生成多个多样化的分区。这一点尤为重要,因为一组多样化的基础聚类对于共识聚类是必需的,而共识聚类已被发现比依赖单一聚类能产生更好且更稳健的结果。为填补这一空白,我们提出了DivClust——一种可融入现有深度聚类框架的多样性控制损失函数,能够生成具有所需多样性程度的多个聚类。我们在多个数据集和深度聚类框架上进行了实验,结果表明:a) 我们的方法能在极小的额外计算成本下有效控制跨框架和数据集的多样性;b) DivClust学习到的聚类集合包含显著优于单一聚类基线的解决方案;c) 使用现成的共识聚类算法,DivClust生成的共识聚类解决方案始终优于单一聚类基线,有效提升了基础深度聚类框架的性能。