EgoEvGesture：基于第一人称事件相机的手势识别 (EgoEvGesture: Gesture Recognition Based on Egocentric Event Camera)

Egocentric gesture recognition is a pivotal technology for enhancing natural human-computer interaction, yet traditional RGB-based solutions suffer from motion blur and illumination variations in dynamic scenarios. While event cameras show distinct advantages in handling high dynamic range with ultra-low power consumption, existing RGB-based architectures face inherent limitations in processing asynchronous event streams due to their synchronous frame-based nature. Moreover, from an egocentric perspective, event cameras record data that include events generated by both head movements and hand gestures, thereby increasing the complexity of gesture recognition. To address this, we propose a novel network architecture specifically designed for event data processing, incorporating (1) a lightweight CNN with asymmetric depthwise convolutions to reduce parameters while preserving spatiotemporal features, (2) a plug-and-play state-space model as context block that decouples head movement noise from gesture dynamics, and (3) a parameter-free Bins-Temporal Shift Module (BSTM) that shifts features along bins and temporal dimensions to fuse sparse events efficiently. We further build the EgoEvGesture dataset, the first large-scale dataset for egocentric gesture recognition using event cameras. Experimental results demonstrate that our method achieves 62.7% accuracy in heterogeneous testing with only 7M parameters, 3.1% higher than state-of-the-art approaches. Notable misclassifications in freestyle motions stem from high inter-personal variability and unseen test patterns differing from training data. Moreover, our approach achieved a remarkable accuracy of 96.97% on DVS128 Gesture, demonstrating strong cross-dataset generalization capability. The dataset and models are made publicly available at https://github.com/3190105222/EgoEv_Gesture.

翻译：第一人称手势识别是增强自然人机交互的关键技术，但传统的基于RGB的解决方案在动态场景中易受运动模糊和光照变化的影响。事件相机在处理高动态范围方面具有超低功耗的显著优势，然而现有的基于RGB的架构由于其基于同步帧的本质，在处理异步事件流时面临固有局限。此外，从第一人称视角看，事件相机记录的数据包含头部运动和手势共同产生的事件，从而增加了手势识别的复杂性。为解决这一问题，我们提出了一种专为事件数据处理设计的新型网络架构，该架构包含：(1) 采用非对称深度卷积的轻量级CNN，以减少参数量同时保留时空特征；(2) 作为上下文模块的即插即用状态空间模型，用于解耦头部运动噪声与手势动态；(3) 无需参数的分箱-时序移位模块，通过沿分箱和时序维度移位特征以高效融合稀疏事件。我们进一步构建了EgoEvGesture数据集，这是首个使用事件相机进行第一人称手势识别的大规模数据集。实验结果表明，我们的方法在异构测试中仅用7M参数即达到62.7%的准确率，比现有最佳方法高出3.1%。自由风格动作中的显著误分类源于较高的个体间差异以及测试模式与训练数据未见的不同。此外，我们的方法在DVS128 Gesture数据集上取得了96.97%的优异准确率，展现出强大的跨数据集泛化能力。数据集与模型已公开于https://github.com/3190105222/EgoEv_Gesture。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日