Multiomics data fusion integrates diverse data modalities, ranging from transcriptomics to proteomics, to gain a comprehensive understanding of biological systems and enhance predictions on outcomes of interest related to disease phenotypes and treatment responses. Cooperative learning, a recently proposed method, unifies the commonly-used fusion approaches, including early and late fusion, and offers a systematic framework for leveraging the shared underlying relationships across omics to strengthen signals. However, the challenge of acquiring large-scale labeled data remains, and there are cases where multiomics data are available but in the absence of annotated labels. To harness the potential of unlabeled multiomcis data, we introduce semi-supervised cooperative learning. By utilizing an "agreement penalty", our method incorporates the additional unlabeled data in the learning process and achieves consistently superior predictive performance on simulated data and a real multiomics study of aging. It offers an effective solution to multiomics data fusion in settings with both labeled and unlabeled data and maximizes the utility of available data resources, with the potential of significantly improving predictive models for diagnostics and therapeutics in an increasingly multiomics world.
翻译:多组学数据融合整合从转录组学到蛋白质组学等多样化的数据模式,旨在全面理解生物系统,并增强对疾病表型及治疗反应相关结局的预测能力。协同学习作为近期提出的一种方法,统一了早期融合与晚期融合等常用融合策略,提供了一个系统性框架,用于利用各组学间共享的潜在关系来强化信号。然而,获取大规模标注数据的挑战依然存在,且常出现多组学数据可用但缺乏标注标签的情况。为挖掘未标注多组学数据的潜力,我们引入了半监督协同学习。该方法通过使用“一致性惩罚”,将额外未标注数据纳入学习过程,在模拟数据及一项真实的老化多组学研究中持续取得更优的预测性能。它为同时包含标注与未标注数据场景下的多组学数据融合提供了有效解决方案,并最大化现有数据资源的利用价值,有望在日益多组学化的世界中显著提升用于诊断与治疗的预测模型。