As key elements within the central dogma, DNA, RNA, and proteins play crucial roles in maintaining life by guaranteeing accurate genetic expression and implementation. Although research on these molecules has profoundly impacted fields like medicine, agriculture, and industry, the diversity of machine learning approaches-from traditional statistical methods to deep learning models and large language models-poses challenges for researchers in choosing the most suitable models for specific tasks, especially for cross-omics and multi-omics tasks due to the lack of comprehensive benchmarks. To address this, we introduce the first comprehensive multi-omics benchmark COMET (Benchmark for Biological COmprehensive Multi-omics Evaluation Tasks and Language Models), designed to evaluate models across single-omics, cross-omics, and multi-omics tasks. First, we curate and develop a diverse collection of downstream tasks and datasets covering key structural and functional aspects in DNA, RNA, and proteins, including tasks that span multiple omics levels. Then, we evaluate existing foundational language models for DNA, RNA, and proteins, as well as the newly proposed multi-omics method, offering valuable insights into their performance in integrating and analyzing data from different biological modalities. This benchmark aims to define critical issues in multi-omics research and guide future directions, ultimately promoting advancements in understanding biological processes through integrated and different omics data analysis.
翻译:作为中心法则中的关键要素,DNA、RNA和蛋白质通过确保精确的基因表达与实现,在维持生命过程中发挥着至关重要的作用。尽管对这些分子的研究已深刻影响医学、农业和工业等领域,但机器学习方法的多样性——从传统统计方法到深度学习模型及大语言模型——使研究人员在选择最适合特定任务的模型时面临挑战,尤其在跨组学与多组学任务中,缺乏综合性基准更是加剧了这一难题。为此,我们提出了首个综合性多组学基准COMET(面向生物综合性多组学评估任务与语言模型的基准),旨在评估模型在单组学、跨组学及多组学任务中的表现。首先,我们整理并构建了涵盖DNA、RNA和蛋白质关键结构与功能维度的多样化下游任务与数据集,包括跨越多个组学层级的任务。随后,我们对现有的DNA、RNA及蛋白质基础语言模型以及新提出的多组学方法进行了系统评估,为这些模型在整合与分析不同生物模态数据方面的性能提供了重要见解。该基准致力于界定多组学研究中的关键问题并指引未来方向,最终通过整合性及差异化的组学数据分析,推动对生物过程理解的进步。