Video understanding is a crucial part of computer vision, with numerous application scenarios. With the increasing popularity of mobile devices, an increasing number of efforts are trying to deploy video understanding models on them. However, existing video understanding models are difficult to deploy due to their large size and prohibitive power consumption. Spiking Neural Networks (SNNs) have shown bioplausibility and low power advantages over Artificial Neural Networks (ANNs), especially on neuromorphic chips which are regarded as essential components of future mobile devices. However, excessively long conversion time-steps and severe performance degradation problems limit their application. To solve the problems above, we explore the application of SNNs on temporal action detection (TAD), which is an important task in video understanding, and propose the first SNN-based end-to-end TAD architecture coined as SpikeTAD. While maintaining extremely low power consumption, SpikeTAD achieves an average mAP of 67.2% in THUMOS14 and 37.42% in ActivityNet-1.3, demonstrating the feasibility of a low-power TAD model. Our code is available at https://github.com/MCG-NJU/SpikeTAD.
翻译:视频理解是计算机视觉的关键部分,拥有众多应用场景。随着移动设备的日益普及,越来越多的工作尝试在其上部署视频理解模型。然而,现有视频理解模型因体积庞大且功耗过高而难以部署。脉冲神经网络(SNNs)相较于人工神经网络(ANNs)具有生物合理性和低功耗优势,尤其在被认为是未来移动设备核心组件的神经形态芯片上表现突出。但过长的转换时间步长和严重的性能退化问题限制了其应用。为解决上述问题,我们探索了SNNs在视频理解重要任务——时序动作检测(TAD)中的应用,并提出了首个基于SNN的端到端TAD架构,命名为SpikeTAD。在保持极低功耗的同时,SpikeTAD在THUMOS14和ActivityNet-1.3数据集上分别实现了67.2%和37.42%的平均mAP,证明了低功耗TAD模型的可行性。我们的代码发布于https://github.com/MCG-NJU/SpikeTAD。