SSTFormer: Bridging Spiking Neural Network and Memory Support Transformer for Frame-Event based Recognition

Event camera-based pattern recognition is a newly arising research topic in recent years. Current researchers usually transform the event streams into images, graphs, or voxels, and adopt deep neural networks for event-based classification. Although good performance can be achieved on simple event recognition datasets, however, their results may be still limited due to the following two issues. Firstly, they adopt spatial sparse event streams for recognition only, which may fail to capture the color and detailed texture information well. Secondly, they adopt either Spiking Neural Networks (SNN) for energy-efficient recognition with suboptimal results, or Artificial Neural Networks (ANN) for energy-intensive, high-performance recognition. However, seldom of them consider achieving a balance between these two aspects. In this paper, we formally propose to recognize patterns by fusing RGB frames and event streams simultaneously and propose a new RGB frame-event recognition framework to address the aforementioned issues. The proposed method contains four main modules, i.e., memory support Transformer network for RGB frame encoding, spiking neural network for raw event stream encoding, multi-modal bottleneck fusion module for RGB-Event feature aggregation, and prediction head. Due to the scarce of RGB-Event based classification dataset, we also propose a large-scale PokerEvent dataset which contains 114 classes, and 27102 frame-event pairs recorded using a DVS346 event camera. Extensive experiments on two RGB-Event based classification datasets fully validated the effectiveness of our proposed framework. We hope this work will boost the development of pattern recognition by fusing RGB frames and event streams. Both our dataset and source code of this work will be released at https://github.com/Event-AHU/SSTFormer.

翻译：基于事件相机的模式识别是近年兴起的研究课题。现有方法通常将事件流转化为图像、图结构或体素，并采用深度神经网络进行事件分类。尽管在简单事件识别数据集上表现良好，但受限于以下两个问题：其一，仅采用空间稀疏的事件流进行识别，难以充分捕获颜色与精细纹理信息；其二，现有方法要么采用脉冲神经网络（SNN）实现低能耗但性能欠佳的识别，要么采用人工神经网络（ANN）实现高能耗但高性能的识别，鲜有方法能在二者间取得平衡。本文首次提出融合RGB帧与事件流的模式识别方案，构建新型RGB-事件联合识别框架以解决上述问题。所提方法包含四个核心模块：用于RGB帧编码的记忆支持Transformer网络、用于原始事件流编码的脉冲神经网络、用于RGB-事件特征聚合的多模态瓶颈融合模块，以及预测头。鉴于缺乏RGB-事件联合分类数据集，我们进一步提出包含114个类别、采用DVS346事件相机记录的27102组帧-事件对的PokerEvent大规模数据集。在两类RGB-事件联合分类数据集上的大量实验充分验证了所提框架的有效性。我们期待该工作通过融合RGB帧与事件流，推动模式识别领域发展。本工作数据集与源代码将发布于https://github.com/Event-AHU/SSTFormer。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日