EventTransAct: A video transformer-based framework for Event-camera based action recognition

Recognizing and comprehending human actions and gestures is a crucial perception requirement for robots to interact with humans and carry out tasks in diverse domains, including service robotics, healthcare, and manufacturing. Event cameras, with their ability to capture fast-moving objects at a high temporal resolution, offer new opportunities compared to standard action recognition in RGB videos. However, previous research on event camera action recognition has primarily focused on sensor-specific network architectures and image encoding, which may not be suitable for new sensors and limit the use of recent advancements in transformer-based architectures. In this study, we employ a computationally efficient model, namely the video transformer network (VTN), which initially acquires spatial embeddings per event-frame and then utilizes a temporal self-attention mechanism. In order to better adopt the VTN for the sparse and fine-grained nature of event data, we design Event-Contrastive Loss ($\mathcal{L}_{EC}$) and event-specific augmentations. Proposed $\mathcal{L}_{EC}$ promotes learning fine-grained spatial cues in the spatial backbone of VTN by contrasting temporally misaligned frames. We evaluate our method on real-world action recognition of N-EPIC Kitchens dataset, and achieve state-of-the-art results on both protocols - testing in seen kitchen (\textbf{74.9\%} accuracy) and testing in unseen kitchens (\textbf{42.43\% and 46.66\% Accuracy}). Our approach also takes less computation time compared to competitive prior approaches, which demonstrates the potential of our framework \textit{EventTransAct} for real-world applications of event-camera based action recognition. Project Page: \url{https://tristandb8.github.io/EventTransAct_webpage/}

翻译：识别和理解人类动作与手势是机器人在服务机器人、医疗保健和制造等多个领域中与人类交互并执行任务的关键感知需求。事件相机凭借其以高时间分辨率捕捉快速运动物体的能力，为相较于标准RGB视频中的动作识别提供了新的机遇。然而，以往关于事件相机动作识别的研究主要侧重于特定传感器的网络架构和图像编码，这可能不适用于新型传感器，并限制了近期基于Transformer架构成果的应用。在本研究中，我们采用了一种计算高效的模型——视频Transformer网络（VTN），它首先获取每个事件帧的空间嵌入，然后利用时间自注意力机制。为了更好地使VTN适应事件数据的稀疏性和细粒度特性，我们设计了事件对比损失（$\mathcal{L}_{EC}$）和事件特定的数据增强方法。所提出的$\mathcal{L}_{EC}$通过对比时间上未对齐的帧，促进VTN空间骨干网络学习细粒度的空间线索。我们在N-EPIC Kitchens数据集的真实世界动作识别上评估了我们的方法，并在两种协议下均取得了最先进的结果——在已见厨房中测试（准确率\textbf{74.9\%}）和在未见厨房中测试（准确率\textbf{42.43\%和46.66\%}）。与以往有竞争力的方法相比，我们的方法还减少了计算时间，这展示了我们的框架\textit{EventTransAct}在基于事件相机的动作识别实际应用中的潜力。项目页面：\url{https://tristandb8.github.io/EventTransAct_webpage/}