We propose methods for the analysis of hierarchical clustering that fully use the multi-resolution structure provided by a dendrogram. Specifically, we propose a loss for choosing between clustering methods, a feature importance score and a graphical tool for visualizing the segmentation of features in a dendrogram. Current approaches to these tasks lead to loss of information since they require the user to generate a single partition of the instances by cutting the dendrogram at a specified level. Our proposed methods, instead, use the full structure of the dendrogram. The key insight behind the proposed methods is to view a dendrogram as a phylogeny. This analogy permits the assignment of a feature value to each internal node of a tree through an evolutionary model. Real and simulated datasets provide evidence that our proposed framework has desirable outcomes and gives more insights than state-of-art approaches. We provide an R package that implements our methods.
翻译:本文提出了一种充分利用树状图多分辨率结构进行层次聚类分析的方法。具体而言,我们提出了用于选择聚类方法的损失函数、特征重要性评分以及可视化树状图中特征分割的图形工具。当前处理这些任务的方法会导致信息损失,因为它们要求用户通过在指定层级切割树状图来生成单一实例划分。相反,我们提出的方法利用了树状图的完整结构。该方案的核心洞察在于将树状图视为系统发育树,这一类比允许通过进化模型为树的每个内部节点分配特征值。真实与模拟数据集表明,我们提出的框架能够产生理想结果,并提供比现有方法更深刻的见解。我们提供了实现该方法的R语言包。