Egocentric gesture recognition is a pivotal technology for enhancing natural human-computer interaction, yet traditional RGB-based solutions suffer from motion blur and illumination variations in dynamic scenarios. While event cameras show distinct advantages in handling high dynamic range with ultra-low power consumption, existing RGB-based architectures face inherent limitations in processing asynchronous event streams due to their synchronous frame-based nature. Moreover, from an egocentric perspective, event cameras record data that include events generated by both head movements and hand gestures, thereby increasing the complexity of gesture recognition. To address this, we propose a novel network architecture specifically designed for event data processing, incorporating (1) a lightweight CNN with asymmetric depthwise convolutions to reduce parameters while preserving spatiotemporal features, (2) a plug-and-play state-space model as context block that decouples head movement noise from gesture dynamics, and (3) a parameter-free Bins-Temporal Shift Module (BSTM) that shifts features along bins and temporal dimensions to fuse sparse events efficiently. We further build the EgoEvGesture dataset, the first large-scale dataset for egocentric gesture recognition using event cameras. Experimental results demonstrate that our method achieves 62.7% accuracy in heterogeneous testing with only 7M parameters, 3.1% higher than state-of-the-art approaches. Notable misclassifications in freestyle motions stem from high inter-personal variability and unseen test patterns differing from training data. Moreover, our approach achieved a remarkable accuracy of 96.97% on DVS128 Gesture, demonstrating strong cross-dataset generalization capability. The dataset and models are made publicly available at https://github.com/3190105222/EgoEv_Gesture.
翻译:第一人称手势识别是增强自然人机交互的关键技术,但传统的基于RGB的解决方案在动态场景中易受运动模糊和光照变化的影响。事件相机在处理高动态范围方面具有超低功耗的显著优势,然而现有的基于RGB的架构由于其基于同步帧的本质,在处理异步事件流时面临固有局限。此外,从第一人称视角看,事件相机记录的数据包含头部运动和手势共同产生的事件,从而增加了手势识别的复杂性。为解决这一问题,我们提出了一种专为事件数据处理设计的新型网络架构,该架构包含:(1) 采用非对称深度卷积的轻量级CNN,以减少参数量同时保留时空特征;(2) 作为上下文模块的即插即用状态空间模型,用于解耦头部运动噪声与手势动态;(3) 无需参数的分箱-时序移位模块,通过沿分箱和时序维度移位特征以高效融合稀疏事件。我们进一步构建了EgoEvGesture数据集,这是首个使用事件相机进行第一人称手势识别的大规模数据集。实验结果表明,我们的方法在异构测试中仅用7M参数即达到62.7%的准确率,比现有最佳方法高出3.1%。自由风格动作中的显著误分类源于较高的个体间差异以及测试模式与训练数据未见的不同。此外,我们的方法在DVS128 Gesture数据集上取得了96.97%的优异准确率,展现出强大的跨数据集泛化能力。数据集与模型已公开于https://github.com/3190105222/EgoEv_Gesture。