Numerous machine learning (ML) models employed in protein function and structure prediction depend on evolutionary information, which is captured through multiple-sequence alignments (MSA) or position-specific scoring matrices (PSSM) as generated by PSI-BLAST. Consequently, these predictive methods are burdened by substantial computational demands and prolonged computing time requirements. The principal challenge stems from the necessity imposed on the PSI-BLAST software to load large sequence databases sequentially in batches and then search for sequence alignments akin to a given query sequence. In the case of batch queries, the runtime scales even linearly. The predicament at hand is becoming more challenging as the size of bio-sequence data repositories experiences exponential growth over time and as a consequence, this upward trend exerts a proportional strain on the runtime of PSI-BLAST. To address this issue, an eminent resolution lies in leveraging the MMseqs2 method, capable of expediting the search process by a magnitude of 100. However, MMseqs2 cannot be directly employed to generate the final output in the desired format of PSI-BLAST alignments and PSSM profiles. In this research work, I developed a comprehensive pipeline that synergistically integrates both MMseqs2 and PSI-BLAST, resulting in the creation of a robust, optimized, and highly efficient hybrid alignment pipeline. Notably, the hybrid tool exhibits a significant speed improvement, surpassing the runtime performance of PSI-BLAST in generating sequence alignment profiles by a factor of two orders of magnitude. It is implemented in C++ and is freely available under the MIT license at https://github.com/issararab/EPSAPG.
翻译:众多应用于蛋白质功能与结构预测的机器学习模型依赖于进化信息,这些信息通过PSI-BLAST生成的多序列比对或位置特异性得分矩阵获取。因此,此类预测方法面临巨大的计算负载和漫长的运行时间需求。其主要挑战源于PSI-BLAST软件需按批次顺序加载大型序列数据库,然后搜索与给定查询序列相似的序列比对。在进行批量查询时,运行时间甚至呈线性增长。随着生物序列数据库规模呈指数级增长,这一困境日益严峻,增长趋势对PSI-BLAST的运行时间造成同比例压力。为解决该问题,一个卓越的方案在于利用MMseqs2方法,其可将搜索速度提升两个数量级。然而,MMseqs2无法直接生成PSI-BLAST比对与PSSM谱所需的最终输出格式。本研究工作中,我开发了一套综合流水线,协同整合MMseqs2与PSI-BLAST,构建出稳健、优化且高效的混合比对流水线。值得注意的是,该混合工具展现出显著的速度提升,在生成序列比对谱方面相比PSI-BLAST的运行时间性能实现了两个数量级的改进。该工具以C++实现,并在MIT许可协议下于https://github.com/issararab/EPSAPG免费提供。