As custom hardware accelerators become more prevalent, it becomes increasingly important to automatically generate efficient host-driver code that can fully leverage the capabilities of these accelerators. This approach saves time and reduces the likelihood of errors that can occur during manual implementation. AXI4MLIR extends the MLIR compiler framework to generate host-driver code for custom accelerators for linear algebra problems. By leveraging specific compiler optimizations, we can further increase accelerator utilization. In this work we offer two key observations through a MatMul accelerator case study. First, the accelerator's compute core utilization is less than 10%, and second, the critical latency bottleneck is caused by copying data between the heap and memory-mapped DMA buffers. We identify a set of missing host code optimizations to improve the under-utilization and the latency bottleneck. Therefore, we propose three key host-code data-movement-related optimizations, extending AXI4MLIR. The optimizations provide DMA-based data allocation, coalescing of DMA transfers, and pipelining of the accelerator's load, compute, and store stages.
翻译:随着定制硬件加速器日益普及,自动生成能够充分利用这些加速器功能的高效主机驱动代码变得愈发重要。该方法可节省时间并降低手动实现过程中可能引入错误的概率。AXI4MLIR扩展了MLIR编译器框架,用于为线性代数问题的定制加速器生成主机驱动代码。通过利用特定的编译器优化,我们可进一步提升加速器利用率。本研究通过矩阵乘法加速器案例提出两项关键发现:其一,加速器的计算核心利用率低于10%;其二,关键延迟瓶颈源于堆与内存映射DMA缓冲区之间的数据复制操作。我们识别出一系列缺失的主机代码优化措施,以改善利用率不足和延迟瓶颈问题。为此,我们提出三项面向主机代码数据移动的关键优化方案,对AXI4MLIR进行扩展。这些优化包括:基于DMA的数据分配、DMA传输合并,以及加速器加载、计算与存储阶段的流水线化处理。