AI kernel compilation for edge devices depends on the compiler's ability to exploit parallelism and hide memory latency in the presence of hierarchical memory and explicit data movement. This paper reports a benchmark methodology and corresponding results for three compiler-controlled mechanisms in an MLIR-based compilation pipeline: vectorization (Vec), multi-threading (MT) across hardware contexts, and double buffering (DB) using ping--pong scratchpad buffers to overlap DMA transfers with compute. Using Triton/Inductor-generated kernels, we present an ablation ladder that separates the contribution of Vec, MT, and DB, and we quantify how MT speedup scales with problem size using GELU as a representative activation kernel. The results show that vectorization provides the primary gain for bandwidth-sensitive kernels, MT delivers substantial improvements once scheduling overhead is amortized, and DB provides additional benefit when transfers and compute can be overlapped (i.e., outside the extremes of purely memory-bound or purely compute-bound behavior).
翻译:面向边缘设备的AI内核编译依赖于编译器在存在分层内存和显式数据移动的情况下利用并行性和隐藏内存延迟的能力。本文报告了一种基准测试方法及相应结果,该方法针对基于MLIR的编译流水线中的三种编译器控制机制:向量化(Vec)、跨硬件上下文的多线程(MT)以及使用乒乓暂存缓冲区通过双缓冲(DB)实现DMA传输与计算的重叠。利用Triton/Inductor生成的内核,我们提出了一种消融阶梯来分离Vec、MT和DB的贡献,并以GELU作为代表性激活函数内核,量化了MT加速比如何随问题规模扩展。结果表明:向量化为带宽敏感型内核提供了主要增益;一旦调度开销被分摊,多线程能带来显著改进;当传输与计算能够重叠时(即在纯粹内存受限或纯粹计算受限的极端情况之外),双缓冲可提供额外收益。