The Versal Adaptive Compute Acceleration Platform (ACAP) is a new architecture that combines AI Engines (AIEs) with reconfigurable fabric. This architecture offers significant acceleration potential for uniform recurrences in various domains, such as deep learning, high-performance computation, and signal processing. However, efficiently mapping these computations onto the Versal ACAP architecture while achieving high utilization of AIEs poses a challenge. To address this issue, we propose a mapping scheme called \fname, which aims to accelerate uniform recurrences on the Versal ACAP architecture by leveraging the features of both the hardware and the computations. Considering the array architecture of AIEs, our approach utilizes space-time transformations based on the polyhedral model to generate legally optimized systolic array mappings. Concurrently, we have developed a routing-aware PLIO assignment algorithm tailored for communication on the AIE array, and the algorithm aims at successful compilation while maximizing array utilization. Furthermore, we introduce an automatic mapping framework. This framework is designed to generate the corresponding executable code for uniform recurrences, which encompasses the AIE kernel program, programmable logic bitstreams, and the host program. The experimental results validate the effectiveness of our mapping scheme. Specifically, when applying our scheme to matrix multiplication computations on the VCK5000 board, we achieve a throughput of 4.15TOPS on float data type, which is 1.11$\times$ higher compared to the state-of-the-art accelerator on the Versal ACAP architecture.
翻译:Versal自适应计算加速平台(ACAP)是一种结合AI引擎(AIE)与可重构结构的新型架构,为深度学习、高性能计算和信号处理等领域的统一递归提供了显著的加速潜力。然而,在实现AIE高利用率的同时,如何高效地将这些计算映射到Versal ACAP架构上仍是一大挑战。针对这一问题,我们提出了一种名为\fname的映射方案,旨在通过充分利用硬件与计算的双重特性,加速Versal ACAP架构上的统一递归。基于AIE的阵列架构,我们的方法采用基于多面体模型的时空变换,生成合法优化的脉动阵列映射。同时,我们开发了一种针对AIE阵列通信的路由感知PLIO分配算法,该算法以最大化阵列利用率为目标,确保编译成功。此外,我们引入了一个自动映射框架,用于为统一递归生成对应的可执行代码,涵盖AIE内核程序、可编程逻辑比特流及主机程序。实验结果验证了我们映射方案的有效性。具体而言,在VCK5000开发板上将方案应用于矩阵乘法计算时,浮点数据类型下实现了4.15TOPS的吞吐量,较Versal ACAP架构上最先进的加速器提升了1.11倍。