Deep State-Space Models (SSM) demonstrate state-of-the art performance on long-range sequence modeling tasks. While the recurrent structure of SSMs can be efficiently implemented as a convolution or as a parallel scan during training, recurrent token-by-token processing cannot currently be implemented efficiently on GPUs. Here, we demonstrate efficient token-by-token inference of the SSM S4D on Intel's Loihi 2 state-of-the-art neuromorphic processor. We compare this first ever neuromorphic-hardware implementation of an SSM on sMNIST, psMNIST, and sCIFAR to a recurrent and a convolutional implementation of S4D on Jetson Orin Nano (Jetson). While we find Jetson to perform better in an offline sample-by-sample based batched processing mode, Loihi 2 outperforms during token-by-token based processing, where it consumes 1000 times less energy with a 75 times lower latency and a 75 times higher throughput compared to the recurrent implementation of S4D on Jetson. This opens up new avenues towards efficient real-time streaming applications of SSMs.
翻译:深度状态空间模型(SSM)在长程序列建模任务中展现出最先进的性能。虽然SSM的循环结构在训练期间可以高效地实现为卷积或并行扫描,但逐令牌的循环处理目前在GPU上无法高效实现。本文中,我们在英特尔最先进的神经形态处理器Loihi 2上展示了SSM S4D模型的高效逐令牌推理。我们首次在神经形态硬件上实现了SSM,并在sMNIST、psMNIST和sCIFAR数据集上,将其与Jetson Orin Nano(Jetson)上S4D的循环实现和卷积实现进行了比较。我们发现,在基于离线逐样本的批处理模式下,Jetson表现更优;而在基于逐令牌的处理过程中,Loihi 2表现更佳——与Jetson上S4D的循环实现相比,其能耗降低1000倍,延迟降低75倍,吞吐量提高75倍。这为SSM在高效实时流式应用方面开辟了新的途径。