Framing the investigation of diverse cancers as a machine learning problem has recently shown significant potential in multi-omics analysis and cancer research. Empowering these successful machine learning models are the high-quality training datasets with sufficient data volume and adequate preprocessing. However, while there exist several public data portals including The Cancer Genome Atlas (TCGA) multi-omics initiative or open-bases such as the LinkedOmics, these databases are not off-the-shelf for existing machine learning models. In this paper we propose MLOmics, an open cancer multi-omics benchmark aiming at serving better the development and evaluation of bioinformatics and machine learning models. MLOmics contains 8,314 patient samples covering all 32 cancer types with four omics types, stratified features, and extensive baselines. Complementary support for downstream analysis and bio-knowledge linking are also included to support interdisciplinary analysis.
翻译:将多种癌症的研究构建为机器学习问题,近年来在多组学分析和癌症研究中展现出巨大潜力。支撑这些成功机器学习模型的是具有充足数据量和适当预处理的高质量训练数据集。然而,尽管目前已存在多个公共数据门户(如癌症基因组图谱多组学计划)或开放数据库(如LinkedOmics),但这些数据库并不能直接为现有机器学习模型所用。本文提出MLOmics——一个开放的癌症多组学基准,旨在更好地服务于生物信息学与机器学习模型的开发与评估。MLOmics包含8,314个患者样本,涵盖全部32种癌症类型,提供四种组学数据类型、分层特征和广泛的基线模型。该基准还包含对下游分析和生物知识关联的补充支持,以促进跨学科研究。