RNA plays a pivotal role in translating genetic instructions into functional outcomes, underscoring its importance in biological processes and disease mechanisms. Despite the emergence of numerous deep learning approaches for RNA, particularly universal RNA language models, there remains a significant lack of standardized benchmarks to assess the effectiveness of these methods. In this study, we introduce the first comprehensive RNA benchmark BEACON (\textbf{BE}nchm\textbf{A}rk for \textbf{CO}mprehensive R\textbf{N}A Task and Language Models). First, BEACON comprises 13 distinct tasks derived from extensive previous work covering structural analysis, functional studies, and engineering applications, enabling a comprehensive assessment of the performance of methods on various RNA understanding tasks. Second, we examine a range of models, including traditional approaches like CNNs, as well as advanced RNA foundation models based on language models, offering valuable insights into the task-specific performances of these models. Third, we investigate the vital RNA language model components from the tokenizer and positional encoding aspects. Notably, our findings emphasize the superiority of single nucleotide tokenization and the effectiveness of Attention with Linear Biases (ALiBi) over traditional positional encoding methods. Based on these insights, a simple yet strong baseline called BEACON-B is proposed, which can achieve outstanding performance with limited data and computational resources. The datasets and source code of our benchmark are available at https://github.com/terry-r123/RNABenchmark.
翻译:RNA在将遗传指令转化为功能结果中起着关键作用,这凸显了其在生物过程和疾病机制中的重要性。尽管已涌现出许多针对RNA的深度学习方法,特别是通用的RNA语言模型,但仍然严重缺乏评估这些方法有效性的标准化基准。在本研究中,我们引入了首个全面的RNA基准测试BEACON(面向全面RNA任务与语言模型的基准测试)。首先,BEACON包含13个不同的任务,这些任务源自先前广泛的研究工作,涵盖结构分析、功能研究和工程应用,能够全面评估方法在各种RNA理解任务上的性能。其次,我们考察了一系列模型,包括传统方法如CNN,以及基于语言模型的先进RNA基础模型,为这些模型在特定任务上的表现提供了有价值的见解。第三,我们从分词器和位置编码两个方面研究了RNA语言模型的关键组成部分。值得注意的是,我们的研究结果强调了单核苷酸分词的优越性,以及Attention with Linear Biases(ALiBi)相较于传统位置编码方法的有效性。基于这些发现,我们提出了一个简单而强大的基线模型BEACON-B,它能够在有限的数据和计算资源下实现出色的性能。我们的基准测试数据集和源代码可在https://github.com/terry-r123/RNABenchmark获取。