As AI chips incorporate numerous parallelized cores to scale deep learning (DL) computing, inter-core communication is enabled recently by employing high-bandwidth and low-latency interconnect links on the chip (e.g., Graphcore IPU). It allows each core to directly access the fast scratchpad memory in other cores, which enables new parallel computing paradigms. However, without proper support for the scalable inter-core connections in current DL compilers, it is hard for developers to exploit the benefits of this new architecture. We present T10, the first DL compiler to exploit the inter-core communication bandwidth and distributed on-chip memory on AI chips. To formulate the computation and communication patterns of tensor operators in this new architecture, T10 introduces a distributed tensor abstraction rTensor. T10 maps a DNN model to execution plans with a generalized compute-shift pattern, by partitioning DNN computation into sub-operators and mapping them to cores, so that the cores can exchange data following predictable patterns. T10 makes globally optimized trade-offs between on-chip memory consumption and inter-core communication overhead, selects the best execution plan from a vast optimization space, and alleviates unnecessary inter-core communications. Our evaluation with a real inter-core connected AI chip, the Graphcore IPU, shows up to 3.3$\times$ performance improvement, and scalability support for larger models, compared to state-of-the-art DL compilers and vendor libraries.
翻译:随着AI芯片集成大量并行化核心以扩展深度学习(DL)计算能力,近期通过采用芯片内高带宽、低延迟互连链路(如Graphcore IPU)实现了跨核通信。该架构使每个核心能够直接访问其他核心的快速暂存存储器,从而催生了新的并行计算范式。然而,由于当前DL编译器缺乏对可扩展跨核连接的有效支持,开发者难以充分利用这一新架构的优势。本文提出T10——首个能利用AI芯片跨核通信带宽与分布式片上存储器的DL编译器。为形式化新架构中张量算子的计算与通信模式,T10引入分布式张量抽象rTensor。通过将DNN计算拆分为子算子并映射至核心,T10采用广义计算-移位模式将DNN模型映射为执行方案,使核心能按可预测模式交换数据。T10在片上存储器消耗与跨核通信开销间进行全局优化权衡,从庞大的优化空间中选取最佳执行方案,并消除不必要的跨核通信。基于真实跨核互联AI芯片Graphcore IPU的评估表明,相较于最先进的DL编译器与厂商库,T10最高可实现3.3$\times$性能提升,并支持更大模型的扩展。