$\textit{De Novo}$ Genome assembly is one of the most important tasks in computational biology. ELBA is the state-of-the-art distributed-memory parallel algorithm for overlap detection and layout simplification steps of $\textit{De Novo}$ genome assembly but exists a performance bottleneck in pairwise alignment. In this work, we proposed 3 GPU schedulers for ELBA to accommodate multiple MPI processes and multiple GPUs. The GPU schedulers enable multiple MPI processes to perform computation on GPUs in a round-robin fashion. Both strong and weak scaling experiments show that 3 schedulers are able to significantly improve the performance of baseline while there is a trade-off between parallelism and GPU scheduler overhead. For the best performance implementation, the one-to-one scheduler achieves $\sim$7-8$\times$ speed-up using 25 MPI processes compared with the baseline vanilla ELBA GPU scheduler.
翻译:从头基因组组装是计算生物学中最重要的任务之一。ELBA是目前用于从头基因组组装中重叠检测和布局简化步骤最先进的分布式内存并行算法,但其在成对比对环节存在性能瓶颈。本研究针对ELBA提出了三种GPU调度器,以支持多个MPI进程与多GPU协同工作。这些调度器使多个MPI进程能以轮询方式在GPU上执行计算。强弱扩展实验均表明,三种调度器能显著提升基线的性能表现,但并行度与GPU调度开销之间存在权衡。在最优实现方案中,与原始ELBA GPU调度器相比,一对一调度器在使用25个MPI进程时实现了约7-8倍的加速比。