Transformer models have set new performance standards for machine learning (ML) tasks. However, their resource-intensive deployment on resource-constrained edge devices for cloud-free, on-chip transformer inference remains challenging. The ARM Compute Library (ARM-CL) framework provides low-latency CNN inference on ARM-based edge devices but lacks support for transformer inference. In this work, we implement several new transformer kernels in ARM-CL to support native transformer execution. Our extended ARM-CL achieves up to three times faster transformer inference compared to state-of-the-art CPU/GPU implementations on an ARM-based embedded board. Furthermore, heterogeneous multi-processor system-on-chips (HMPSoCs) powering edge devices provide both embedded CPUs and GPUs. We introduce cooperative CPU-GPU transformer inference, which executes memory-intensive operations on the CPU while utilizing the GPU for highly parallelizable, compute-intensive operations. This cooperative execution, implemented with minimal overhead, further reduces transformer inference latency by up to 15.72% compared to the best single-processor inference on ARM-CL.
翻译:Transformer模型在机器学习任务中树立了新的性能标杆。然而,在资源受限的边缘设备上部署这些模型以实现无云端支持的片上Transformer推理仍具挑战性。ARM计算库框架支持基于ARM的边缘设备上的低延迟CNN推理,但缺乏对Transformer推理的支持。本工作中,我们在ARM-CL中实现了多个新型Transformer内核以支持原生Transformer执行。与基于ARM的嵌入式电路板上最先进的CPU/GPU实现相比,扩展后的ARM-CL可提升Transformer推理速度达三倍。此外,为边缘设备供电的异构多处理器片上系统同时提供嵌入式CPU和GPU。我们引入了CPU-GPU协同Transformer推理机制,该机制在GPU上执行高可并行化的计算密集型任务,同时在CPU上执行内存密集型操作。这种协同执行方案以最小开销实现,相较于ARM-CL上最优的单处理器推理,可将Transformer推理延迟进一步降低至多15.72%。