The structures of RNA sequences play a vital role in various cellular processes, while existing genomic foundation models (FMs) have struggled with precise sequence-structure alignment, due to the complexity of exponential combinations of nucleotide bases. In this study, we introduce OmniGenome, a foundation model that addresses this critical challenge of sequence-structure alignment in RNA FMs. OmniGenome bridges the sequences with secondary structures using structure-contextualized modeling, enabling hard in-silico genomic tasks that existing FMs cannot handle, e.g., RNA design tasks. The results on two comprehensive genomic benchmarks show that OmniGenome achieves state-of-the-art performance on complex RNA subtasks. For example, OmniGenome solved 74% of complex puzzles, compared to SpliceBERT which solved only 3% of the puzzles. Besides, OmniGenome solves most of the puzzles within $1$ hour, while the existing methods usually allocate $24$ hours for each puzzle. Overall, OmniGenome establishes wide genomic application cases and offers profound insights into biological mechanisms from the perspective of sequence-structure alignment.
翻译:RNA序列的结构在多种细胞过程中起着至关重要的作用,然而现有的基因组基础模型(FMs)由于核苷酸碱基呈指数级组合的复杂性,一直难以实现精确的序列-结构对齐。在本研究中,我们提出了OmniGenome,这是一个旨在解决RNA基础模型中序列-结构对齐这一关键挑战的基础模型。OmniGenome通过结构情境化建模将序列与二级结构联系起来,从而能够处理现有基础模型无法应对的复杂计算基因组任务,例如RNA设计任务。在两个综合性基因组基准测试上的结果表明,OmniGenome在复杂的RNA子任务上实现了最先进的性能。例如,OmniGenome解决了74%的复杂难题,而SpliceBERT仅解决了3%的难题。此外,OmniGenome在$1$小时内解决了大部分难题,而现有方法通常为每个难题分配$24$小时。总体而言,OmniGenome建立了广泛的基因组应用案例,并从序列-结构对齐的视角为理解生物机制提供了深刻的见解。