Video analysis is a major computer vision task that has received a lot of attention in recent years. The current state-of-the-art performance for video analysis is achieved with Deep Neural Networks (DNNs) that have high computational costs and need large amounts of labeled data for training. Spiking Neural Networks (SNNs) have significantly lower computational costs (thousands of times) than regular non-spiking networks when implemented on neuromorphic hardware. They have been used for video analysis with methods like 3D Convolutional Spiking Neural Networks (3D CSNNs). However, these networks have a significantly larger number of parameters compared with spiking 2D CSNN. This, not only increases the computational costs, but also makes these networks more difficult to implement with neuromorphic hardware. In this work, we use CSNNs trained in an unsupervised manner with the Spike Timing-Dependent Plasticity (STDP) rule, and we introduce, for the first time, Spiking Separated Spatial and Temporal Convolutions (S3TCs) for the sake of reducing the number of parameters required for video analysis. This unsupervised learning has the advantage of not needing large amounts of labeled data for training. Factorizing a single spatio-temporal spiking convolution into a spatial and a temporal spiking convolution decreases the number of parameters of the network. We test our network with the KTH, Weizmann, and IXMAS datasets, and we show that S3TCs successfully extract spatio-temporal information from videos, while increasing the output spiking activity, and outperforming spiking 3D convolutions.
翻译:视频分析是计算机视觉领域的一项重要任务,近年来受到广泛关注。当前最先进的视频分析性能由深度神经网络(DNN)实现,但其计算成本高且需要大量标注数据进行训练。脉冲神经网络(SNN)在神经形态硬件上实现时,其计算成本(数千倍)显著低于常规非脉冲网络。已有研究通过三维卷积脉冲神经网络(3D CSNN)等方法将SNN应用于视频分析。然而,此类网络相较二维脉冲CSNN的参数规模显著增大,不仅增加计算成本,还加大了在神经形态硬件上的实现难度。本文采用基于脉冲时序依赖可塑性(STDP)规则进行无监督训练的CSNN,并首次提出脉冲分离时空卷积(S3TC)以降低视频分析所需的参数数量。这种无监督学习的优势在于无需大量标注数据即可完成训练。将单个时空脉冲卷积分解为空间脉冲卷积和时间脉冲卷积可减少网络参数量。我们在KTH、Weizmann和IXMAS数据集上进行了测试,结果表明S3TC能有效提取视频中的时空信息,同时提升输出脉冲活动强度,且性能优于脉冲三维卷积。