Spiking neural operators are appealing for neuromorphic edge computing because event-driven substrates can, in principle, translate sparse activity into lower latency and energy. Whether that advantage survives deployment on commodity edge-GPU software stacks, however, remains unclear. We study this question on a Jetson Orin Nano 8 GB using five pretrained variable-spiking wavelet neural operator (VS-WNO) checkpoints and five matched dense wavelet neural operator (WNO) checkpoints on the Darcy rectangular benchmark. On a reference-aligned path, VS-WNO exhibits substantial algorithmic sparsity, with mean spike rates decreasing from 54.26% at the first spiking layer to 18.15% at the fourth. On a deployment-style request path, however, this sparsity does not reduce deployed cost: VS-WNO reaches 59.6 ms latency and 228.0 mJ dynamic energy per inference, whereas dense WNO reaches 53.2 ms and 180.7 mJ, while also achieving slightly lower reference-path error (1.77% versus 1.81%). Nsight Systems indicates that the request path remains launch-dominated and dense rather than sparsity-aware: for VS-WNO, cudaLaunchKernel accounts for 81.6% of CUDA API time within the latency window, and dense convolution kernels account for 53.8% of GPU kernel time; dense WNO shows the same pattern. On this Jetson-class GPU stack, spike sparsity is measurable but does not reduce deployed cost because the runtime does not suppress dense work as spike activity decreases.
翻译:脉冲神经算子因其事件驱动特性能够将稀疏活动转化为更低的延迟和能耗,在原则上对神经形态边缘计算具有吸引力。然而,这种优势是否能在商用边缘GPU软件栈上得以保留,目前尚不明确。我们在配备8 GB内存的Jetson Orin Nano上,使用五个预训练的变脉冲小波神经算子(VS-WNO)检查点和五个匹配的密集小波神经算子(WNO)检查点,在Darcy矩形基准上研究了这一问题。在参考对齐路径上,VS-WNO展现出显著的算法稀疏性,平均脉冲率从第一个脉冲层的54.26%下降到第四个脉冲层的18.15%。然而,在部署风格的请求路径上,这种稀疏性并未降低部署成本:VS-WNO的推理延迟为59.6毫秒,每次推理的动态能量为228.0毫焦,而密集WNO的延迟为53.2毫秒,动态能量为180.7毫焦,同时参考路径误差略低(1.77%对1.81%)。Nsight Systems分析表明,请求路径仍以启动为主且为密集而非稀疏感知型:对于VS-WNO,cudaLaunchKernel在延迟窗口内占CUDA API时间的81.6%,密集卷积核占GPU内核时间的53.8%;密集WNO表现出相同的模式。在此类Jetson级GPU软件栈上,脉冲稀疏性是可测量的,但并未降低部署成本,因为运行时并未随脉冲活动减少而抑制密集计算。