Event cameras are neuromorphic vision sensors representing visual information as sparse and asynchronous event streams. Most state-of-the-art event-based methods project events into dense frames and process them with conventional learning models. However, these approaches sacrifice the sparsity and high temporal resolution of event data, resulting in a large model size and high computational complexity. To fit the sparse nature of events and sufficiently explore the relationship between them, we develop a novel attention-aware model named Event Voxel Set Transformer (EVSTr) for spatiotemporal representation learning on event streams. It first converts the event stream into voxel sets and then hierarchically aggregates voxel features to obtain robust representations. The core of EVSTr is an event voxel transformer encoder to extract discriminative spatiotemporal features, which consists of two well-designed components, including a Multi-Scale Neighbor Embedding Layer (MNEL) for local information aggregation and a Voxel Self-Attention Layer (VSAL) for global feature interactions. Enabling the network to incorporate a long-range temporal structure, we introduce a segment modeling strategy to learn motion patterns from a sequence of segmented voxel sets. We evaluate the proposed model on two event-based recognition tasks: object classification and action recognition. Comprehensive experiments show that EVSTr achieves state-of-the-art performance while maintaining low model complexity. Additionally, we present a new dataset (NeuroHAR) recorded in challenging visual scenarios to complement the lack of real-world event-based datasets for action recognition.
翻译:事件相机是一种神经形态视觉传感器,以稀疏且异步的事件流形式表示视觉信息。当前最先进的事件驱动方法大多将事件投影为密集帧,并借助传统学习模型进行处理。然而,这类方法牺牲了事件数据的稀疏性和高时间分辨率,导致模型体积庞大且计算复杂度高。为契合事件的稀疏本质并充分挖掘事件间关联,我们提出了一种新型注意力感知模型——事件体素集变换器(EVSTr),用于事件流的时空表征学习。该模型首先将事件流转化为体素集,随后通过层次化聚合体素特征以获取鲁棒表征。EVSTr的核心是一个事件体素变换编码器,旨在提取判别性时空特征,其包含两个精心设计的组件:用于局部信息聚合的多尺度邻域嵌入层(MNEL)和用于全局特征交互的体源自注意力层(VSAL)。为使网络具备长时序结构建模能力,我们引入了一种分段建模策略,从一系列分割后的体素集中学习运动模式。我们在两项基于事件数据的识别任务(目标分类与动作识别)上评估了所提模型。综合实验表明,在保持低模型复杂度的同时,EVSTr实现了最先进的性能。此外,我们贡献了一个在挑战性视觉场景中录制的全新数据集(NeuroHAR),以弥补真实世界事件数据在动作识别领域的不足。