COMET: Benchmark for Comprehensive Biological Multi-omics Evaluation Tasks and Language Models

Yuchen Ren,Wenwei Han,Qianyuan Zhang,Yining Tang,Weiqiang Bai,Yuchen Cai,Lifeng Qiao,Hao Jiang,Dong Yuan,Tao Chen,Siqi Sun,Pan Tan,Wanli Ouyang,Nanqing Dong,Xinzhu Ma,Peng Ye

As key elements within the central dogma, DNA, RNA, and proteins play crucial roles in maintaining life by guaranteeing accurate genetic expression and implementation. Although research on these molecules has profoundly impacted fields like medicine, agriculture, and industry, the diversity of machine learning approaches-from traditional statistical methods to deep learning models and large language models-poses challenges for researchers in choosing the most suitable models for specific tasks, especially for cross-omics and multi-omics tasks due to the lack of comprehensive benchmarks. To address this, we introduce the first comprehensive multi-omics benchmark COMET (Benchmark for Biological COmprehensive Multi-omics Evaluation Tasks and Language Models), designed to evaluate models across single-omics, cross-omics, and multi-omics tasks. First, we curate and develop a diverse collection of downstream tasks and datasets covering key structural and functional aspects in DNA, RNA, and proteins, including tasks that span multiple omics levels. Then, we evaluate existing foundational language models for DNA, RNA, and proteins, as well as the newly proposed multi-omics method, offering valuable insights into their performance in integrating and analyzing data from different biological modalities. This benchmark aims to define critical issues in multi-omics research and guide future directions, ultimately promoting advancements in understanding biological processes through integrated and different omics data analysis.

翻译：作为中心法则中的关键要素，DNA、RNA和蛋白质通过确保精确的基因表达与实现，在维持生命过程中发挥着至关重要的作用。尽管对这些分子的研究已深刻影响医学、农业和工业等领域，但机器学习方法的多样性——从传统统计方法到深度学习模型及大语言模型——使研究人员在选择最适合特定任务的模型时面临挑战，尤其在跨组学与多组学任务中，缺乏综合性基准更是加剧了这一难题。为此，我们提出了首个综合性多组学基准COMET（面向生物综合性多组学评估任务与语言模型的基准），旨在评估模型在单组学、跨组学及多组学任务中的表现。首先，我们整理并构建了涵盖DNA、RNA和蛋白质关键结构与功能维度的多样化下游任务与数据集，包括跨越多个组学层级的任务。随后，我们对现有的DNA、RNA及蛋白质基础语言模型以及新提出的多组学方法进行了系统评估，为这些模型在整合与分析不同生物模态数据方面的性能提供了重要见解。该基准致力于界定多组学研究中的关键问题并指引未来方向，最终通过整合性及差异化的组学数据分析，推动对生物过程理解的进步。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日