The emergence of the Next Generation Sequencing increases drastically the volume of transcriptomic data. Although many standalone algorithms and workflows for novel microRNA (miRNA) prediction have been proposed, few are designed for processing large volume of sequence data from large genomes, and even fewer further annotate functional miRNAs by analyzing multiple libraries. We propose an improved pipeline for a high volume data facility by implementing mirLibSpark based on the Apache Spark framework. This pipeline is the fastest actual method, and provides an accuracy improvement compared to the standard. In this paper, we deliver the first distributed functional miRNA predictor as a standalone and fully automated package. It is an efficient and accurate miRNA predictor with functional insight. Furthermore, it compiles with the gold-standard requirement on plant miRNA predictions.
翻译:下一代测序技术的出现极大地增加了转录组数据的规模。尽管已有许多用于新型microRNA(miRNA)预测的独立算法与工作流程被提出,但鲜有专门设计用于处理大规模基因组海量序列数据的方案,而能通过分析多个文库进一步注释功能性miRNA的方法则更为罕见。我们通过基于Apache Spark框架实现mirLibSpark,提出了一种面向高通量数据设施的改进流程。该流程是当前最快的实用方法,并在保持标准规范的同时提升了预测准确性。本文首次实现了分布式功能性miRNA预测器,其作为独立且全自动的软件包,成为兼具功能洞察力的高效精准miRNA预测工具。此外,该工具完全符合植物miRNA预测领域的金标准要求。