We introduce an unsupervised learning approach that combines the truncated singular value decomposition with convex clustering to estimate within-cluster directions of maximum variance/covariance (in the variables) while simultaneously hierarchically clustering (on observations). In contrast to previous work on joint clustering and embedding, our approach has a straightforward formulation, is readily scalable via distributed optimization, and admits a direct interpretation as hierarchically clustered principal component analysis (PCA), hierarchically clustered locally linear embedding (LLE), or hierarchically clustered canonical correlation analysis (CCA). Through numerical experiments and real-world examples relevant to precision medicine, we show that our approach outperforms traditional and contemporary clustering methods on both underdetermined problems ($p \gg N$ with tens of observations) and on large datasets (e.g., $N=100,000$) while yielding interpretable dendrograms of hierarchical per-cluster principal components or canonical variates.
翻译:我们提出一种无监督学习方法,该方法将截断奇异值分解与凸聚类相结合,在实现变量上最大方差/协方差方向估计的同时,同步对观测样本进行层次化聚类。与以往联合聚类与嵌入的研究不同,本方法具有简洁的数学形式,可通过分布式优化实现高效扩展,并能直接解释为层次化主成分分析(PCA)、层次化局部线性嵌入(LLE)或层次化典型相关分析(CCA)。通过数值实验和精准医学领域的真实案例验证,我们证明该方法在欠定问题($p \gg N$,观测样本数十个)和大规模数据集(如$N=100,000$)上均优于传统及当代聚类方法,同时可生成可解释的层次化簇内主成分或典型变量的树状图。