Neuromorphic event cameras possess superior temporal resolution, power efficiency, and dynamic range compared to traditional cameras. However, their asynchronous and sparse data format poses a significant challenge for conventional deep learning methods. Most existing methods either densify events into frames, sacrificing their sparse asynchronous nature, or use irregular models that are less compatible with GPU acceleration. Inspired by word-to-vector models, we propose event2vec, a novel representation that allows Transformers to process events directly. We demonstrate the effectiveness of event2vec on the DVS Gesture, ASL-DVS, and DVS-Lip benchmarks, showing that event2vec is remarkably parameter-efficient, features high throughput and low latency, and achieves high accuracy even with an extremely low number of events or low spatial resolutions. These results show that sparse asynchronous event data can be directly integrated into high-throughput Transformer architectures, offering an efficient paradigm for real-time neuromorphic vision. The code is provided at https://github.com/Intelligent-Computing-Lab-Panda/event2vec.
翻译:与传统相机相比,神经形态事件相机具有更高的时间分辨率、能效和动态范围。然而,其异步且稀疏的数据格式对传统深度学习方法构成了显著挑战。现有方法要么将事件稠密化为帧,牺牲其稀疏异步特性,要么采用与GPU加速兼容性较差的非规则模型。受词到向量模型启发,我们提出event2vec这一新型表示方法,使Transformer能够直接处理事件。我们在DVS Gesture、ASL-DVS和DVS-Lip基准上验证了event2vec的有效性,结果表明其参数效率显著、具有高吞吐量和低延迟的特点,即便在事件数量极少或空间分辨率极低的情况下仍能实现高精度。这些成果表明,稀疏异步事件数据可直接集成到高吞吐量的Transformer架构中,为实时神经形态视觉提供了一种高效范式。代码已开源在https://github.com/Intelligent-Computing-Lab-Panda/event2vec。