We present a QPU-first ML runtime stack for Raspberry Pi 5's VideoCore VII QPU, built on top of the py-videocore7 assembly library. The system comprises reusable tiled matrix-multiplication substrate, GEMM-backed convolution, a single-head attention-style core, persistent executors, and integer execution based on smul24 instructions. For dense integer kernels, packed INT16-input with INT32 accumulation achieves nearly two orders of magnitude higher throughput over NumPy. Across operations (min/max, pooling, convolution, attention), we report improved performance over both PyTorch and NumPy. Our preliminary results indicate that Raspberry QPUs can serve as a practical execution substrate towards accelerating AI model execution at the edge.
翻译:我们提出了一种基于 py-videocore7 汇编库、面向树莓派5的VideoCore VII QPU的QPU优先机器学习运行时栈。该系统包含可复用的分块矩阵乘法基元、基于GEMM的卷积、单头注意力风格核心、持久化执行器以及基于smul24指令的整数执行逻辑。在密集型整数核上,采用打包INT16输入与INT32累加的方式,其吞吐量相比NumPy提升了近两个数量级。在各类运算(最小值/最大值、池化、卷积、注意力)中,我们报告了相比PyTorch和NumPy均有性能提升的成果。初步结果表明,Raspberry QPU可作为加速边缘AI模型执行的实用化计算基板。