Clustering is commonly performed as an initial analysis step for uncovering structure in 'omics datasets, e.g. to discover molecular subtypes of disease. The high-throughput, high-dimensional nature of these datasets means that they provide information on a diverse array of different biomolecular processes and pathways. Different groups of variables (e.g. genes or proteins) will be implicated in different biomolecular processes, and hence undertaking analyses that are limited to identifying just a single clustering partition of the whole dataset is therefore liable to conflate the multiple clustering structures that may arise from these distinct processes. To address this, we propose a multi-view Bayesian mixture model that identifies groups of variables (``views"), each of which defines a distinct clustering structure. We consider applications in stratified medicine, for which our principal goal is to identify clusters of patients that define distinct, clinically actionable disease subtypes. We adopt the semi-supervised, outcome-guided mixture modelling approach of Bayesian profile regression that makes use of a response variable in order to guide inference toward the clusterings that are most relevant in a stratified medicine context. We present the model, together with illustrative simulation examples, and examples from pan-cancer proteomics. We demonstrate how the approach can be used to perform integrative clustering, and consider an example in which different 'omics datasets are integrated in the context of breast cancer subtyping.
翻译:聚类分析通常作为揭示组学数据内在结构的初始分析步骤,例如发现疾病的分子亚型。由于这些数据集具有高通量、高维度的特性,它们提供了不同生物分子过程与信号通路的多维度信息。不同变量组(如基因或蛋白质)会参与不同的生物分子过程,因此仅对整体数据集进行单一聚类划分的分析方法,容易混淆由不同生物过程产生的多重聚类结构。为解决这一问题,我们提出一种多视角贝叶斯混合模型,该模型能够识别定义不同聚类结构的变量组(即"视角")。本研究聚焦于分层医学应用场景,核心目标是识别定义临床可干预疾病亚型的患者聚类。我们采用半监督式、结果导向的贝叶斯剖面回归混合建模方法,通过引入响应变量来引导推断过程聚焦于分层医学语境下最具相关性的聚类结构。本文呈现了该模型框架,并辅以示例性模拟实验及泛癌蛋白质组学案例分析。我们展示了该方法在整合聚类分析中的应用,并特别探讨了不同组学数据集在乳腺癌分型中整合分析的实例。