Bayesian genome-wide clustering and variable selection of transcriptomic data via rank-based mixtures

With the increasing availability of ranking data, there has been a growing demand for appropriate unsupervised rank-based inferential frameworks capable of handling high-dimensional datasets and providing uncertainty quantification for all estimates. Rank-based methods have also seen a growing popularity in -omics pipelines, as ranking continuous measurements provides a robust means of handling non-normally distributed data. The Bayesian Mallows model (BMM) has emerged as a promising choice because of its adaptability to various types of ranking data and its flexible framework, integrating cluster-wise rank aggregation with inference at the individual level. However, the scalability of BMM to ultra-high-dimensional settings, such as -omics analyses, has remained limited. The present paper addresses this issue by introducing the first rank-based model generalizing BMM to jointly handle clustering and variable selection, namely the lower-dimensional Bayesian Mallows Model Mixture (lowBM3). The proposed method provides a novel Bayesian framework that simultaneously handles heterogeneity in the sample, unsupervised parameter estimation, and model selection in a scalable manner for ultra-high-dimensional data. Additionally, a companion postprocessing framework is introduced to provide posterior summaries of the discrete posterior distributions of both the consensus ranking and the variable selector. Simulation studies are performed to assess the performance of the method. The usefulness of the method is also shown in an application to signature discovery for cancer genomics, where RNA-seq bulk gene expression data obtained from breast cancer patients are clustered genome-wide.

翻译：随着排序数据的日益可用，人们对能够处理高维数据集并为所有估计提供不确定性量化的无监督排序推断框架的需求日益增长。排序方法在组学分析流程中也越来越受欢迎，因为对连续测量值进行排序为处理非正态分布数据提供了一种稳健的手段。贝叶斯马洛斯模型因其对多种排序数据的适应性以及灵活框架（将聚类级排序聚合与个体级推断相结合）而成为一种有前景的选择。然而，BMM在超高维设置（如组学分析）中的可扩展性仍然有限。本文通过引入首个基于排序的模型，将BMM推广到联合处理聚类与变量选择，即低维贝叶斯马洛斯模型混合（lowBM3），从而解决了这一问题。所提出的方法提供了一种新颖的贝叶斯框架，能够以可扩展的方式同时处理样本异质性、无监督参数估计以及超高维数据的模型选择。此外，还引入了一个配套的后处理框架，为共识排序和变量选择器的离散后验分布提供后验汇总。通过模拟研究评估了该方法的性能。该方法在癌症基因组学特征发现中的应用也展示了其实用性，其中对乳腺癌患者获得的RNA-seq批量基因表达数据进行了全基因组聚类。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

【剑桥大学博士论文】贝叶斯机器学习进展:从不确定性到决策，272页pdf

专知会员服务

83+阅读 · 2023年2月5日

什么是贝叶斯workflow？牛津大学最新《贝叶斯工作流》教程及论文，附75页Slides与视频

专知会员服务

59+阅读 · 2022年9月27日

【干货书】贝叶斯统计分析方法，697页pdf

专知会员服务

126+阅读 · 2021年12月18日

Transformer！「预训练变换器文本排序」首篇综述书，155页pdf概述BERT类模型文本检索进展

专知会员服务

69+阅读 · 2021年3月18日