Contrastive, self-supervised learning (SSL) is used to train a model that predicts cancer type from miRNA, mRNA or RPPA expression data. This model, a pretrained FT-Transformer, is shown to outperform XGBoost and CatBoost, standard benchmarks for tabular data, when labelled samples are scarce but the number of unlabelled samples is high. This is despite the fact that the datasets we use have $\mathcal{O}(10^{1})$ classes and $\mathcal{O}(10^{2})-\mathcal{O}(10^{4})$ features. After demonstrating the efficacy of our chosen method of self-supervised pretraining, we investigate SSL for multi-modal models. A late-fusion model is proposed, where each omics is passed through its own sub-network, the outputs of which are averaged and passed to the pretraining or downstream objective function. Multi-modal pretraining is shown to improve predictions from a single omics, and we argue that this is useful for datasets with many unlabelled multi-modal samples, but few labelled unimodal samples. Additionally, we show that pretraining each omics-specific module individually is highly effective. This enables the application of the proposed model in a variety of contexts where a large amount of unlabelled data is available from each omics, but only a few labelled samples.
翻译:对比自监督学习(SSL)用于训练基于miRNA、mRNA或RPPA表达数据预测癌症类型的模型。该模型采用预训练的FT-Transformer架构,在标注样本稀缺但未标注样本数量充足的情况下,其性能优于表格数据标准基准模型XGBoost和CatBoost。尽管所使用数据集包含$\mathcal{O}(10^{1})$个类别和$\mathcal{O}(10^{2})-\mathcal{O}(10^{4})$维特征,这一结论仍然成立。在验证所选自监督预训练方法的有效性后,我们进一步探究SSL在多模态模型中的应用。提出一种晚期融合模型架构:各组学数据通过独立子网络处理,输出结果经平均后传入预训练或下游任务目标函数。研究表明,多模态预训练能提升单一组学的预测性能,该机制适用于拥有大量未标注多模态样本但仅有少量标注单模态样本的数据集。此外,我们发现对各组学专用模块进行独立预训练效果显著,这使得所提模型能够应用于各类场景——当各模态存在海量未标注数据而仅具有少量标注样本时仍可有效工作。