We uncover that current objective-based Divisive Hierarchical Clustering (DHC) methods produce a dendrogram that does not have three desired properties i.e., no unwarranted splitting, group similar clusters into a same subset, ground-truth correspondence. This shortcoming has their root cause in using a set-oriented bisecting assessment criterion. We show that this shortcoming can be addressed by using a distributional kernel, instead of the set-oriented criterion; and the resultant clusters achieve a new distribution-oriented objective to maximize the total similarity of all clusters (TSC). Our theoretical analysis shows that the resultant dendrogram guarantees a lower bound of TSC. The empirical evaluation shows the effectiveness of our proposed method on artificial and Spatial Transcriptomics (bioinformatics) datasets. Our proposed method successfully creates a dendrogram that is consistent with the biological regions in a Spatial Transcriptomics dataset, whereas other contenders fail.
翻译:本文揭示,当前基于目标函数的分裂式层次聚类方法所生成的树状图不具备三个期望属性:无不当分裂、将相似聚类归入同一子集、与真实情况对应。这一缺陷的根源在于使用了面向集合的二分评估准则。我们证明,通过采用分布核函数替代该集合导向准则,可克服此缺陷;所得聚类实现了一种新的分布导向目标,即最大化所有聚类的总相似度。理论分析表明,所得树状图能保证TSC的下界。实证评估在人工数据集和空间转录组学(生物信息学)数据集上验证了所提方法的有效性。在空间转录组学数据集中,所提方法成功构建了与生物区域一致的树状图,而其他对比方法均未能实现。