Hierarchical clustering is an effective and interpretable technique for analyzing structure in data, offering a nuanced understanding by revealing insights at multiple scales and resolutions. It is particularly helpful in settings where the exact number of clusters is unknown, and provides a robust framework for exploring complex datasets. Additionally, hierarchical clustering can uncover inner structures within clusters, capturing subtle relationships and nested patterns that may be obscured by traditional flat clustering methods. However, existing hierarchical clustering methods struggle with high-dimensional data, especially when there are no clear density gaps between modes. Our method addresses this limitation by leveraging a two-stage approach, first employing a Gaussian or Student's t mixture model to overcluster the data, and then hierarchically merging clusters based on the induced density landscape. This approach yields state-of-the-art clustering performance while also providing a meaningful hierarchy, making it a valuable tool for exploratory data analysis. Code is available at https://github.com/ecker-lab/tneb clustering.
翻译:分层聚类是一种有效且可解释的数据结构分析技术,通过揭示多尺度与多分辨率下的数据洞察,提供细致入微的理解。该方法在聚类数量未知的场景中尤为有用,为探索复杂数据集提供了稳健的框架。此外,分层聚类能够揭示聚类内部的结构,捕捉传统平面聚类方法可能掩盖的细微关系与嵌套模式。然而,现有分层聚类方法在处理高维数据时存在困难,尤其是在模态间缺乏明显密度间隙的情况下。本文提出的方法通过两阶段策略应对这一局限:首先采用高斯或Student's t混合模型对数据进行过聚类,随后基于诱导出的密度景观对聚类进行分层合并。该方法在实现先进聚类性能的同时,提供了有意义的层次结构,使其成为探索性数据分析的有力工具。代码发布于https://github.com/ecker-lab/tneb clustering。