We present a novel approach for egocentric action recognition that leverages 2D point tracks as an additional motion cue. While most existing methods rely on RGB appearance, human pose estimation, or their combination, our work demonstrates that tracking randomly sampled image points across video frames can substantially improve recognition accuracy. Unlike prior approaches, we do not detect hands, objects, or interaction regions. Instead, we employ CoTracker to follow a set of randomly initialized points through each video and use the resulting trajectories, together with the corresponding image frames, as input to a Transformer-based recognition model. Surprisingly, our method achieves notable gains even when only the initial frame and its associated point tracks are provided, without incorporating the full video sequence. Experimental results confirm that integrating 2D point tracks consistently enhances performance compared to the same model trained without motion information, highlighting their potential as a lightweight yet effective representation for egocentric action understanding.
翻译:本文提出了一种新颖的自我中心动作识别方法,该方法利用二维点轨迹作为额外的运动线索。尽管现有方法大多依赖于RGB外观、人体姿态估计或其组合,我们的研究表明,通过视频帧跟踪随机采样的图像点能够显著提升识别准确率。与先前方法不同,我们无需检测手部、物体或交互区域。相反,我们采用CoTracker对每个视频中一组随机初始化的点进行跟踪,并将生成的轨迹与对应的图像帧一同作为基于Transformer的识别模型的输入。令人惊讶的是,即使仅提供初始帧及其关联的点轨迹而不使用完整视频序列,我们的方法仍能取得显著性能提升。实验结果证实,与未引入运动信息的相同模型相比,整合二维点轨迹能够持续提升性能,这凸显了其作为轻量且有效的自我中心动作理解表征的潜力。