Spiking Neural Networks (SNNs), with their temporal processing capabilities and biologically plausible dynamics, offer a natural platform for unsupervised representation learning. However, current unsupervised SNNs predominantly employ shallow architectures or localized plasticity rules, limiting their ability to model long-range temporal dependencies and maintain temporal feature consistency. This results in semantically unstable representations, thereby impeding the development of deep unsupervised SNNs for large-scale temporal video data. We propose PredNext, which explicitly models temporal relationships through cross-view future Step Prediction and Clip Prediction. This plug-and-play module seamlessly integrates with diverse self-supervised objectives. We firstly establish standard benchmarks for SNN self-supervised learning on UCF101, HMDB51, and MiniKinetics, which are substantially larger than conventional DVS datasets. PredNext delivers significant performance improvements across different tasks and self-supervised methods. PredNext achieves performance comparable to ImageNet-pretrained supervised weights, through unsupervised training solely on UCF101. Additional experiments demonstrate that PredNext, distinct from forced consistency constraints, substantially improves temporal feature consistency while enhancing network generalization capabilities. This work provides a effective foundation for unsupervised deep SNNs on large-scale temporal video data.
翻译:脉冲神经网络(SNN)凭借其时间处理能力与生物可塑性动态特性,为无监督表征学习提供了天然平台。然而,当前无监督SNN大多采用浅层架构或局部可塑性规则,限制了其对长程时间依赖关系的建模能力与时间特征一致性的保持。这导致语义表征不稳定,从而阻碍了面向大规模时间视频数据的深度无监督SNN的发展。我们提出PredNext,通过跨视图的未来步骤预测与片段预测显式建模时间关系。该即插即用模块可无缝集成多种自监督目标。我们首次在UCF101、HMDB51和MiniKinetics上建立了SNN自监督学习的标准基准,这些数据集规模远超传统DVS数据集。PredNext在不同任务与自监督方法上均带来显著性能提升。仅通过在UCF101上的无监督训练,PredNext即可达到与ImageNet预训练监督权重相当的性能。进一步实验表明,与强制一致性约束不同,PredNext在提升网络泛化能力的同时,显著改善了时间特征一致性。本工作为面向大规模时间视频数据的无监督深度SNN提供了有效基础。