Inspired by the success of large language models (LLM) for DNA and proteins, several LLM for RNA have been developed recently. RNA-LLM uses large datasets of RNA sequences to learn, in a self-supervised way, how to represent each RNA base with a semantically rich numerical vector. This is done under the hypothesis that obtaining high-quality RNA representations can enhance data-costly downstream tasks. Among them, predicting the secondary structure is a fundamental task for uncovering RNA functional mechanisms. In this work we present a comprehensive experimental analysis of several pre-trained RNA-LLM, comparing them for the RNA secondary structure prediction task in an unified deep learning framework. The RNA-LLM were assessed with increasing generalization difficulty on benchmark datasets. Results showed that two LLM clearly outperform the other models, and revealed significant challenges for generalization in low-homology scenarios.
翻译:受大型语言模型(LLM)在DNA和蛋白质领域成功应用的启发,近期已开发出多种面向RNA的LLM。RNA-LLM利用大规模RNA序列数据集,通过自监督学习方式,将每个RNA碱基表示为语义丰富的数值向量。该方法基于以下假设:获得高质量的RNA表征能够提升数据成本高昂的下游任务性能。其中,RNA二级结构预测是揭示RNA功能机制的基础任务。本研究对多种预训练的RNA-LLM进行了全面的实验分析,在统一的深度学习框架中比较它们在RNA二级结构预测任务上的表现。通过在基准数据集上逐步增加泛化难度对RNA-LLM进行评估,结果表明有两个LLM明显优于其他模型,同时揭示了在低同源性场景下模型泛化面临的重要挑战。