Recent advancements in image classification have demonstrated that contrastive learning (CL) can aid in further learning tasks by acquiring good feature representation from a limited number of data samples. In this paper, we applied CL to tumor transcriptomes and clinical data to learn feature representations in a low-dimensional space. We then utilized these learned features to train a classifier to categorize tumors into a high- or low-risk group of recurrence. Using data from The Cancer Genome Atlas (TCGA), we demonstrated that CL can significantly improve classification accuracy. Specifically, our CL-based classifiers achieved an area under the receiver operating characteristic curve (AUC) greater than 0.8 for 14 types of cancer, and an AUC greater than 0.9 for 2 types of cancer. We also developed CL-based Cox (CLCox) models for predicting cancer prognosis. Our CLCox models trained with the TCGA data outperformed existing methods significantly in predicting the prognosis of 19 types of cancer under consideration. The performance of CLCox models and CL-based classifiers trained with TCGA lung and prostate cancer data were validated using the data from two independent cohorts. We also show that the CLCox model trained with the whole transcriptome significantly outperforms the Cox model trained with the 21 genes of Oncotype DX that is in clinical use for breast cancer patients. CL-based classifiers and CLCox models for 19 types of cancer are publicly available and can be used to predict cancer prognosis using the RNA-seq transcriptome of an individual tumor. Python codes for model training and testing are also publicly accessible, and can be applied to train new CL-based models using gene expression data of tumors.
翻译:近年来图像分类领域的进展表明,对比学习能够通过从有限数据样本中获取优质特征表示来辅助后续学习任务。本研究将对比学习应用于肿瘤转录组和临床数据,在低维空间中学习特征表示,并利用这些学习到的特征训练分类器,将肿瘤划分为复发高风险或低风险组。通过使用癌症基因组图谱(TCGA)数据,我们证明对比学习能显著提升分类精度。具体而言,基于对比学习的分类器对14种癌症的受试者工作特征曲线下面积(AUC)超过0.8,对2种癌症的AUC超过0.9。我们还开发了基于对比学习的Cox(CLCox)模型用于癌症预后预测。经TCGA数据训练的CLCox模型在预测19种癌症预后方面显著优于现有方法。通过两个独立队列的数据验证了TCGA肺癌和前列腺癌数据训练的CLCox模型及基于对比学习的分类器性能。研究还表明,使用全转录组训练的CLCox模型显著优于目前临床用于乳腺癌患者的21基因Oncotype DX训练的Cox模型。供19种癌症使用的基于对比学习的分类器和CLCox模型现已公开,可通过个体肿瘤的RNA-seq转录组数据预测癌症预后。模型训练与测试的Python代码也已开放获取,可用于利用肿瘤基因表达数据训练新的对比学习模型。