Due to the high heterogeneity and clinical characteristics of cancer, there are significant differences in multi-omics data and clinical features among subtypes of different cancers. Therefore, the identification and discovery of cancer subtypes are crucial for the diagnosis, treatment, and prognosis of cancer. In this study, we proposed a generalization framework based on attention mechanisms for unsupervised contrastive learning to analyze cancer multi-omics data for the identification and characterization of cancer subtypes. The framework contains a symmetric unsupervised multi-head attention encoder, which can deeply extract contextual features and long-range dependencies of multi-omics data, reducing the impact of noise in multi-omics data. Importantly, the proposed framework includes a decoupled contrastive learning model (DEDUCE) based on a multi-head attention mechanism to learn multi-omics data features and clustering and identify cancer subtypes. This method clusters subtypes by calculating the similarity between samples in the feature space and sample space of multi-omics data. The basic idea is to decouple different attributes of multi-omics data features and learn them as contrasting terms. Construct a contrastive loss function to measure the difference between positive examples and negative examples, and minimize this difference, thereby encouraging the model to learn better feature representation. The DEDUCE model conducts large-scale experiments on simulated multi-omics data sets, single-cell multi-omics data sets and cancer multi-omics data sets, and the results are better than 10 deep learning models. Finally, we used the DEDUCE model to reveal six cancer subtypes of AML. By analyzing GO functional enrichment, subtype-specific biological functions and GSEA of AML,
翻译:由于癌症的高度异质性和临床特征,不同癌症亚型间的多组学数据及临床特征存在显著差异。因此,癌症亚型的识别与发现对癌症诊断、治疗和预后至关重要。本研究提出一种基于注意力机制的通用框架,用于无监督对比学习分析癌症多组学数据,以识别和表征癌症亚型。该框架包含对称的无监督多头注意力编码器,能够深度提取多组学数据的上下文特征和长程依赖关系,降低多组学数据噪声的影响。重要的是,该框架包含一个基于多头注意力机制的解耦对比学习模型(DEDUCE),用于学习多组学数据特征并进行聚类以识别癌症亚型。该方法通过计算多组学数据特征空间和样本空间中样本间的相似性进行亚型聚类,其核心思想是将多组学数据特征的不同属性解耦为对比项进行学习,构建对比损失函数衡量正例与负例差异,并最小化该差异,从而激励模型学习更优的特征表征。DEDUCE模型在模拟多组学数据集、单细胞多组学数据集和癌症多组学数据集上进行了大规模实验,结果优于10种深度学习模型。最后,我们利用DEDUCE模型揭示了AML的六种癌症亚型,通过GSEA及AML的GO功能富集和亚型特异性生物学功能分析,