Dedicated accelerator hardware has become essential for processing AI-based workloads, leading to the rise of novel accelerator architectures. Furthermore, fundamental differences in memory architecture and parallelism have made these accelerators targets for scientific computing. The sequence alignment problem is fundamental in bioinformatics; we have implemented the $X$-Drop algorithm, a heuristic method for pairwise alignment that reduces search space, on the Graphcore Intelligence Processor Unit (IPU) accelerator. The $X$-Drop algorithm has an irregular computational pattern, which makes it difficult to accelerate due to load balancing. Here, we introduce a graph-based partitioning and queue-based batch system to improve load balancing. Our implementation achieves $10\times$ speedup over a state-of-the-art GPU implementation and up to $4.65\times$ compared to CPU. In addition, we introduce a memory-restricted $X$-Drop algorithm that reduces memory footprint by $55\times$ and efficiently uses the IPU's limited low-latency SRAM. This optimization further improves the strong scaling performance by $3.6\times$.
翻译:专用加速器硬件已成为处理AI工作负载的关键,推动了新型加速器架构的涌现。此外,内存架构与并行性的根本差异使这些加速器成为科学计算的目标。序列比对问题是生物信息学的基础问题;我们在Graphcore智能处理器单元(IPU)加速器上实现了$X$-Drop算法,这是一种通过减少搜索空间进行双序列比对的启发式方法。$X$-Drop算法具有不规则的计算模式,导致其因负载均衡问题而难以加速。本文提出了一种基于图的分区方法和基于队列的批处理系统来改善负载均衡。我们的实现相比最先进的GPU实现实现了$10\times$加速,相比CPU实现了最高$4.65\times$加速。此外,我们提出了一种内存受限的$X$-Drop算法,将内存占用减少了$55\times$,并高效利用了IPU有限的低延迟SRAM。该优化进一步将强扩展性性能提升了$3.6\times$。