Diffusion inference remains costly for edge deployment, yet existing accelerators focus almost exclusively on score networks because standard drift is merely a trivial linear scaling. Kuramoto orientation diffusion replaces this trivial drift with locally coupled phase interactions, improving sampling efficiency but introducing a new hardware bottleneck: a center-dependent nonlinear 5 x 5 stencil evaluated at every reverse step. This kernel maps poorly to conventional CNN accelerators and matrix-oriented engines. We present SA-Kura, to our knowledge the first digital systolic-array accelerator dedicated to locally coupled Kuramoto drift. By reformulating pair-wise sinusoidal coupling into neighbor accumulation independent of the center phase followed by a single center-dependent multiply-subtract combination, SA-Kura eliminates in-PE transcendental units and enables regular systolic execution with register-level reuse. SA-Kura was implemented in synthesizable RTL, integrated into a lightweight RISC-V-based SoC, prototyped on FPGA, and evaluated through 45 nm CMOS synthesis and power analysis. For the drift kernel only, compared with software execution of the same kernel on the processor core in the same SoC platform, SA-Kura reduces latency and energy by 193x and 69.4x, respectively. Compared with a standalone Jetson Orin Nano CUDA implementation of the same kernel, it is 6.57x faster and achieves approximately 46.0x lower energy per pixel.
翻译:扩散推理在边缘部署中仍面临高昂成本,现有加速器几乎完全聚焦于得分网络,因为标准漂移仅是简单的线性缩放。Kuramoto方向扩散以局部耦合的相位交互取代了该简单漂移,虽提升了采样效率,却引入了新的硬件瓶颈:每次反向步骤中需评估一个中心依赖的非线性5×5模板。该计算核心难以映射至传统CNN加速器与矩阵导向引擎。本文提出SA-Kura——据我们所知首款面向局部耦合Kuramoto漂移的数字脉动阵列加速器。通过将成对正弦耦合重构为与中心相位无关的邻域累加、后接单一中心依赖的乘法-减法组合,SA-Kura消除了处理单元内超越函数单元,实现了具有寄存器级复用的规则脉动执行。SA-Kura在可综合RTL中实现,集成至基于RISC-V的轻量级片上系统,经FPGA原型验证,并通过45纳米CMOS综合与功耗分析评估。仅针对漂移核心,相较于同平台上处理器核心的软件执行,SA-Kura分别降低延迟与能耗达193倍与69.4倍;相较于相同核心的独立Jetson Orin Nano CUDA实现,其速度提升6.57倍,每像素能耗降低约46.0倍。