Patchwork learning arises as a new and challenging data collection paradigm where both samples and features are observed in fragmented subsets. Due to technological limits, measurement expense, or multimodal data integration, such patchwork data structures are frequently seen in neuroscience, healthcare, and genomics, among others. Instead of analyzing each data patch separately, it is highly desirable to extract comprehensive knowledge from the whole data set. In this work, we focus on the clustering problem in patchwork learning, aiming at discovering clusters amongst all samples even when some are never jointly observed for any feature. We propose a novel spectral clustering method called Cluster Quilting, consisting of (i) patch ordering that exploits the overlapping structure amongst all patches, (ii) patchwise SVD, (iii) sequential linear mapping of top singular vectors for patch overlaps, followed by (iv) k-means on the combined and weighted singular vectors. Under a sub-Gaussian mixture model, we establish theoretical guarantees via a non-asymptotic misclustering rate bound that reflects both properties of the patch-wise observation regime as well as the clustering signal and noise dependencies. We also validate our Cluster Quilting algorithm through extensive empirical studies on both simulated and real data sets in neuroscience and genomics, where it discovers more accurate and scientifically more plausible clusters than other approaches.
翻译:拼布学习作为一种新颖且具有挑战性的数据收集范式出现,其中样本和特征均在碎片化的子集中被观测到。由于技术限制、测量成本或多模态数据整合等因素,此类拼布数据结构常见于神经科学、医疗保健和基因组学等领域。相较于单独分析每个数据碎片,从整个数据集中提取综合性知识显得尤为重要。本研究聚焦于拼布学习中的聚类问题,旨在发现所有样本间的聚类结构,即使某些样本从未在任何特征上被联合观测。我们提出了一种名为拼布聚类的新型谱聚类方法,其步骤包括:(i)利用所有碎片间重叠结构的碎片排序;(ii)分块奇异值分解;(iii)对碎片重叠部分的顶部奇异向量进行顺序线性映射;(iv)对加权组合后的奇异向量执行k均值聚类。在高斯混合子模型框架下,我们通过非渐近误聚类率界限建立了理论保证,该界限同时反映了分块观测机制的特性以及聚类信号与噪声的依赖关系。我们还在神经科学和基因组学的模拟与真实数据集上进行了大量实证研究,验证了拼布聚类算法的有效性。实验表明,相较于其他方法,该算法能发现更精确且更具科学合理性的聚类结构。