Event Voxel Set Transformer for Spatiotemporal Representation Learning on Event Streams

Event cameras are neuromorphic vision sensors representing visual information as sparse and asynchronous event streams. Most state-of-the-art event-based methods project events into dense frames and process them with conventional learning models. However, these approaches sacrifice the sparsity and high temporal resolution of event data, resulting in a large model size and high computational complexity. To fit the sparse nature of events and sufficiently explore their implicit relationship, we develop a novel attention-aware framework named Event Voxel Set Transformer (EVSTr) for spatiotemporal representation learning on event streams. It first converts the event stream into a voxel set and then hierarchically aggregates voxel features to obtain robust representations. The core of EVSTr is an event voxel transformer encoder to extract discriminative spatiotemporal features, which consists of two well-designed components, including a multi-scale neighbor embedding layer (MNEL) for local information aggregation and a voxel self-attention layer (VSAL) for global representation modeling. Enabling the framework to incorporate a long-term temporal structure, we introduce a segmental consensus strategy for modeling motion patterns over a sequence of segmented voxel sets. We evaluate the proposed framework on two event-based tasks: object classification and action recognition. Comprehensive experiments show that EVSTr achieves state-of-the-art performance while maintaining low model complexity. Additionally, we present a new dataset (NeuroHAR) recorded in challenging visual scenarios to address the lack of real-world event-based datasets for action recognition.

翻译：事件相机是一种神经形态视觉传感器，以稀疏且异步的事件流形式表示视觉信息。当前大多数基于事件的最先进方法将事件投影为密集帧，并利用传统学习模型进行处理。然而，这些方法牺牲了事件数据的稀疏性和高时间分辨率，导致模型体积庞大且计算复杂度较高。为契合事件的稀疏特性并充分挖掘其隐含关系，我们提出了一种名为事件体素集Transformer（EVSTr）的新型注意力感知框架，用于事件流的时空表征学习。该框架首先将事件流转换为体素集，随后通过分层聚合体素特征以获取鲁棒表征。EVSTr的核心是事件体素Transformer编码器，用于提取判别性时空特征，该编码器包含两个精心设计的组件：用于局部信息聚合的多尺度邻域嵌入层（MNEL）和用于全局表征建模的体源自注意力层（VSAL）。为使框架能够融合长期时间结构，我们引入了一种分段共识策略，用于建模分段体素集序列的运动模式。我们在两个基于事件的任务上评估了所提框架：物体分类与动作识别。综合实验表明，EVSTr在保持低模型复杂度的同时，实现了最先进的性能。此外，我们发布了一个在挑战性视觉场景下记录的新数据集（NeuroHAR），以弥补现实世界基于事件的动作识别数据集的匮乏。