Video analysis is a computer vision task that is useful for many applications like surveillance, human-machine interaction, and autonomous vehicles. Deep Convolutional Neural Networks (CNNs) are currently the state-of-the-art methods for video analysis. However they have high computational costs, and need a large amount of labeled data for training. In this paper, we use Convolutional Spiking Neural Networks (CSNNs) trained with the unsupervised Spike Timing-Dependent Plasticity (STDP) learning rule for action classification. These networks represent the information using asynchronous low-energy spikes. This allows the network to be more energy efficient and neuromorphic hardware-friendly. However, the behaviour of CSNNs is not studied enough with spatio-temporal computer vision models. Therefore, we explore transposing two-stream neural networks into the spiking domain. Implementing this model with unsupervised STDP-based CSNNs allows us to further study the performance of these networks with video analysis. In this work, we show that two-stream CSNNs can successfully extract spatio-temporal information from videos despite using limited training data, and that the spiking spatial and temporal streams are complementary. We also show that using a spatio-temporal stream within a spiking STDP-based two-stream architecture leads to information redundancy and does not improve the performance.
翻译:视频分析是一项计算机视觉任务,广泛应用于监控、人机交互和自动驾驶等场景。深度卷积神经网络目前是视频分析的最先进方法,但其计算成本高且需要大量标注数据进行训练。本文采用基于无监督尖峰时间依赖可塑性学习规则训练的卷积脉冲神经网络进行动作分类。这些网络通过异步低能耗脉冲表示信息,使其具备更高的能效性和神经形态硬件友好性。然而,卷积脉冲神经网络在时空计算机视觉模型中的表现尚未得到充分研究。因此,我们探索将双流神经网络转化为脉冲域的实现方案。通过基于无监督STDP的卷积脉冲神经网络实现该模型,可进一步研究这些网络在视频分析中的性能。本研究证明:脉冲双流卷积神经网络尽管使用有限的训练数据,仍能成功提取视频中的时空信息,且脉冲空间流与时间流具有互补性。同时研究表明,在基于STDP的脉冲双流架构中加入时空流会导致信息冗余,且无法提升性能。