As datasets continue to grow in size and complexity, finding succinct yet accurate data summaries poses a key challenge. Centroid-based clustering, a widely adopted approach to address this challenge, finds informative summaries of datasets in terms of few prototypes, each representing a cluster in the data. Despite their wide adoption, the resulting data summaries often contain redundancies, limiting their effectiveness particularly in datasets characterized by a large number of underlying clusters. To overcome this limitation, we introduce the Khatri-Rao clustering paradigm that extends traditional centroid-based clustering to produce more succinct but equally accurate data summaries by postulating that centroids arise from the interaction of two or more succinct sets of protocentroids. We study two central approaches to centroid-based clustering, namely the well-established k-Means algorithm and the increasingly popular topic of deep clustering, under the lens of the Khatri-Rao paradigm. To this end, we introduce the Khatri-Rao k-Means algorithm and the Khatri-Rao deep clustering framework. Extensive experiments show that Khatri-Rao k-Means can strike a more favorable trade-off between succinctness and accuracy in data summarization than standard k-Means. Leveraging representation learning, the Khatri-Rao deep clustering framework offers even greater benefits, reducing even more the size of data summaries given by deep clustering while preserving their accuracy.
翻译:随着数据集规模和复杂性的持续增长,寻找简洁而准确的数据摘要成为一个关键挑战。基于质心的聚类是应对这一挑战的广泛采用方法,它通过少量原型来寻找信息丰富的数据集摘要,每个原型代表数据中的一个簇。尽管应用广泛,但由此产生的数据摘要通常包含冗余,这限制了其有效性,尤其是在具有大量潜在簇的数据集中。为克服这一局限,我们引入了Khatri-Rao聚类范式,该范式扩展了传统的基于质心的聚类,通过假设质心源自两个或多个简洁的原型质心集合的相互作用,从而产生更简洁但同样准确的数据摘要。我们在Khatri-Rao范式的视角下,研究了基于质心聚类的两种核心方法:即成熟的k-Means算法和日益流行的深度聚类主题。为此,我们提出了Khatri-Rao k-Means算法和Khatri-Rao深度聚类框架。大量实验表明,在数据摘要任务中,Khatri-Rao k-Means能在简洁性与准确性之间取得比标准k-Means更优的平衡。借助表示学习,Khatri-Rao深度聚类框架提供了更大的优势,在保持准确性的同时,进一步减小了深度聚类所提供的数据摘要的规模。