With the increasing availability of ranking data, there has been a growing demand for appropriate unsupervised rank-based inferential frameworks capable of handling high-dimensional datasets and providing uncertainty quantification for all estimates. Rank-based methods have also seen a growing popularity in -omics pipelines, as ranking continuous measurements provides a robust means of handling non-normally distributed data. The Bayesian Mallows model (BMM) has emerged as a promising choice because of its adaptability to various types of ranking data and its flexible framework, integrating cluster-wise rank aggregation with inference at the individual level. However, the scalability of BMM to ultra-high-dimensional settings, such as -omics analyses, has remained limited. The present paper addresses this issue by introducing the first rank-based model generalizing BMM to jointly handle clustering and variable selection, namely the lower-dimensional Bayesian Mallows Model Mixture (lowBM3). The proposed method provides a novel Bayesian framework that simultaneously handles heterogeneity in the sample, unsupervised parameter estimation, and model selection in a scalable manner for ultra-high-dimensional data. Additionally, a companion postprocessing framework is introduced to provide posterior summaries of the discrete posterior distributions of both the consensus ranking and the variable selector. Simulation studies are performed to assess the performance of the method. The usefulness of the method is also shown in an application to signature discovery for cancer genomics, where RNA-seq bulk gene expression data obtained from breast cancer patients are clustered genome-wide.
翻译:随着排序数据的日益可用,人们对能够处理高维数据集并为所有估计提供不确定性量化的无监督排序推断框架的需求日益增长。排序方法在组学分析流程中也越来越受欢迎,因为对连续测量值进行排序为处理非正态分布数据提供了一种稳健的手段。贝叶斯马洛斯模型因其对多种排序数据的适应性以及灵活框架(将聚类级排序聚合与个体级推断相结合)而成为一种有前景的选择。然而,BMM在超高维设置(如组学分析)中的可扩展性仍然有限。本文通过引入首个基于排序的模型,将BMM推广到联合处理聚类与变量选择,即低维贝叶斯马洛斯模型混合(lowBM3),从而解决了这一问题。所提出的方法提供了一种新颖的贝叶斯框架,能够以可扩展的方式同时处理样本异质性、无监督参数估计以及超高维数据的模型选择。此外,还引入了一个配套的后处理框架,为共识排序和变量选择器的离散后验分布提供后验汇总。通过模拟研究评估了该方法的性能。该方法在癌症基因组学特征发现中的应用也展示了其实用性,其中对乳腺癌患者获得的RNA-seq批量基因表达数据进行了全基因组聚类。