Integration of heterogeneous and high-dimensional multi-omics data is becoming increasingly important in understanding genetic data. Each omics technique only provides a limited view of the underlying biological process and integrating heterogeneous omics layers simultaneously would lead to a more comprehensive and detailed understanding of diseases and phenotypes. However, one obstacle faced when performing multi-omics data integration is the existence of unpaired multi-omics data due to instrument sensitivity and cost. Studies may fail if certain aspects of the subjects are missing or incomplete. In this paper, we propose a deep learning method for multi-omics integration with incomplete data by Cross-omics Linked unified embedding with Contrastive Learning and Self Attention (CLCLSA). Utilizing complete multi-omics data as supervision, the model employs cross-omics autoencoders to learn the feature representation across different types of biological data. The multi-omics contrastive learning, which is used to maximize the mutual information between different types of omics, is employed before latent feature concatenation. In addition, the feature-level self-attention and omics-level self-attention are employed to dynamically identify the most informative features for multi-omics data integration. Extensive experiments were conducted on four public multi-omics datasets. The experimental results indicated that the proposed CLCLSA outperformed the state-of-the-art approaches for multi-omics data classification using incomplete multi-omics data.
翻译:异构高维多组学数据的整合在理解遗传数据方面日益重要。每种组学技术仅能提供潜在生物学过程的有限视角,而同时整合异构组学层次将有助于更全面详尽地理解疾病与表型。然而,进行多组学数据整合时面临的一个障碍是,由于仪器灵敏度和成本问题,存在未配对的多组学数据。若受试者的某些方面数据缺失或不完整,研究可能失败。本文提出一种基于跨组学对比学习自注意力联合嵌入(CLCLSA)的深度学习方法,用于处理不完整数据的多组学整合。该模型利用完整多组学数据作为监督,通过跨组学自编码器学习不同类型生物数据的特征表示。在潜在特征拼接前,采用多组学对比学习最大化不同组学类型间的互信息。此外,特征级自注意力和组学级自注意力被用于动态识别多组学数据整合中最具信息量的特征。在四个公开多组学数据集上进行了广泛实验。实验结果表明,所提出的CLCLSA在使用不完整多组学数据的分类任务中优于现有最先进方法。