Predicting the consensus structure of a set of aligned RNA homologs is a convenient method to find conserved structures in an RNA genome, which has many applications including viral diagnostics and therapeutics. However, the most commonly used tool for this task, RNAalifold, is prohibitively slow for long sequences, due to a cubic scaling with the sequence length, taking over a day on 400 SARS-CoV-2 and SARS-related genomes (~30,000nt). We present LinearAlifold, a much faster alternative that scales linearly with both the sequence length and the number of sequences, based on our work LinearFold that folds a single RNA in linear time. Our work is orders of magnitude faster than RNAalifold (0.7 hours on the above 400 genomes, or ~36$\times$ speedup) and achieves higher accuracies when compared to a database of known structures. More interestingly, LinearAlifold's prediction on SARS-CoV-2 correlates well with experimentally determined structures, substantially outperforming RNAalifold. Finally, LinearAlifold supports two energy models (Vienna and BL*) and four modes: minimum free energy (MFE), maximum expected accuracy (MEA), ThreshKnot, and stochastic sampling, each of which takes under an hour for hundreds of SARS-CoV variants. Our resource is at: https://github.com/LinearFold/LinearAlifold (code) and http://linearfold.org/linear-alifold (server).
翻译:预测一组比对后的RNA同源序列的一致性结构,是发现RNA基因组中保守结构的一种便捷方法,该方法在病毒诊断和治疗等领域具有广泛应用。然而,该任务最常用的工具RNAalifold,因其时间复杂度随序列长度呈立方增长,在处理长序列时速度极慢——例如,在400条SARS-CoV-2及相关SARS基因组(约30,000个核苷酸)上运行需耗时超过一天。我们提出了LinearAlifold,这是一种基于我们在线性时间内折叠单个RNA的工作LinearFold的、速度更快的替代方案,其时间复杂度随序列长度和序列数量均呈线性增长。我们的方法比RNAalifold快数个数量级(在上述400条基因组上仅需0.7小时,加速比约36倍),并且与已知结构数据库进行比较时达到了更高的准确率。更有趣的是,LinearAlifold对SARS-CoV-2的预测结果与实验确定的结构高度相关,显著优于RNAalifold。最后,LinearAlifold支持两种能量模型(Vienna和BL*)和四种模式:最小自由能(MFE)、最大期望准确率(MEA)、ThreshKnot以及随机采样,每种模式在处理数百条SARS-CoV变体时均能在一小时内完成。相关资源位于:https://github.com/LinearFold/LinearAlifold(代码)和 http://linearfold.org/linear-alifold(服务器)。