We design and implement parallel prefix sum (scan) algorithms using Ascend AI accelerators. Ascend accelerators feature specialized computing units: the cube units for efficient matrix multiplication and the vector units for optimized vector operations. A key feature of the proposed scan algorithms is their extensive use of matrix multiplications and accumulations enabled by the cube unit. To showcase the effectiveness of these algorithms, we also implement and evaluate several scan-based operators commonly used in AI workloads, including sorting, tensor masking, and top-$k$ / top-$p$ sampling. Our single-core results demonstrate substantial performance improvements, with speedups ranging from $5\times$ to $9.6\times$ compared to vector-only implementations for sufficiently large input lengths. Additionally, we present a multi-core scan algorithm that fully utilizes both the cube and vector units of Ascend, reaching up to 74.9\% of the memory bandwidth achieved by memory copy. Furthermore, our radix sort implementation, which utilizes matrix multiplications for its parallel splits, showcases the potential of matrix engines to enhance complex operations, offering up to $3.3\times$ speedup over the vector-only baseline.
翻译:我们设计并实现了基于昇腾AI加速器的并行前缀和(扫描)算法。昇腾加速器具备专用计算单元:用于高效矩阵运算的立方体计算单元以及用于优化向量操作的向量计算单元。所提扫描算法的核心特征在于充分利用立方体计算单元实现的矩阵乘法与累加运算。为验证算法有效性,我们还实现并评估了AI工作负载中常用的若干基于扫描的算子,包括排序、张量掩码及top-$k$/top-$p$采样。单核测试结果表明,在输入长度足够大时,相比纯向量实现可获得5倍至9.6倍的显著性能提升。此外,我们提出了一种多核扫描算法,能够同时充分利用昇腾加速器的立方体与向量计算单元,其内存带宽利用率最高可达内存拷贝操作的74.9%。特别地,我们利用矩阵乘法实现并行分割的基数排序方案,展现了矩阵计算引擎增强复杂运算的潜力,相比纯向量基线最高可获得3.3倍的加速比。