Neuromorphic visual sensors are artificial retinas that output sequences of asynchronous events when brightness changes occur in the scene. These sensors offer many advantages including very high temporal resolution, no motion blur and smart data compression ideal for real-time processing. In this study, we introduce an event-based dataset on fine-grained manipulation actions and perform an experimental study on the use of transformers for action prediction with events. There is enormous interest in the fields of cognitive robotics and human-robot interaction on understanding and predicting human actions as early as possible. Early prediction allows anticipating complex stages for planning, enabling effective and real-time interaction. Our Transformer network uses events to predict manipulation actions as they occur, using online inference. The model succeeds at predicting actions early on, building up confidence over time and achieving state-of-the-art classification. Moreover, the attention-based transformer architecture allows us to study the role of the spatio-temporal patterns selected by the model. Our experiments show that the Transformer network captures action dynamic features outperforming video-based approaches and succeeding with scenarios where the differences between actions lie in very subtle cues. Finally, we release the new event dataset, which is the first in the literature for manipulation action recognition. Code will be available at https://github.com/DaniDeniz/EventVisionTransformer.
翻译:神经形态视觉传感器是模拟视网膜的装置,当场景中亮度发生变化时,会输出异步事件序列。这类传感器具备极高时间分辨率、无运动模糊和智能数据压缩等优势,特别适合实时处理。本研究引入了一个基于事件的精细操控动作数据集,并开展了基于Transformer架构进行事件驱动动作预测的实验研究。在认知机器人与人机交互领域,尽早理解和预测人类动作具有重大意义。早期预测能够预判复杂规划阶段,实现高效实时交互。我们提出的Transformer网络通过在线推理,利用事件实时预测操控动作。该模型能够实现动作早期预测,随时间推移逐步建立置信度,并达到当前最优分类性能。此外,基于注意力机制的Transformer架构使我们能够研究模型所选择的时空模式的作用。实验表明,Transformer网络能够捕获动作动态特征,其性能优于基于视频的方法,并能成功处理仅存在细微差异的动作场景。最后,我们发布了首个面向操控动作识别的全新事件数据集。代码将于https://github.com/DaniDeniz/EventVisionTransformer 公开。