We present a novel approach for hand-object action recognition that leverages 2D point tracks as an additional motion cue. While most existing methods rely on RGB appearance, human pose estimation, or their combination, our work demonstrates that tracking randomly sampled image points across video frames can substantially improve recognition accuracy. Unlike prior approaches, we do not detect hands, objects, or interaction regions. Instead, we employ CoTracker to follow a set of randomly initialized points through each video and use the resulting trajectories, together with the corresponding image frames, as input to a Transformer-based recognition model. Surprisingly, our method achieves notable gains even when only the initial frame and the point tracks are provided, without incorporating the full video sequence. Experimental results confirm that integrating 2D point tracks consistently enhances performance compared to the same model trained without motion information, highlighting their potential as a lightweight yet effective representation for hand-object action understanding.
翻译:我们提出了一种新颖的手-物体动作识别方法,该方法利用二维点轨迹作为额外的运动线索。虽然现有方法大多依赖RGB外观、人体姿态估计或其组合,但我们的研究表明,在视频帧间跟踪随机采样的图像点能显著提升识别精度。与先前方法不同,我们无需检测手部、物体或交互区域,而是通过CoTracker在每个视频中跟踪一组随机初始化的点,并将生成的轨迹与对应图像帧共同作为基于Transformer的识别模型的输入。值得注意的是,即使仅提供初始帧和点轨迹而不使用完整视频序列,我们的方法仍能实现显著性能提升。实验结果证实,相较于未引入运动信息的相同模型,整合二维点轨迹能持续提升性能,这凸显了其作为手-物体动作理解的轻量且有效表征的潜力。