From Vision-Language-Action (VLA) systems to robotics, existing egocentric datasets primarily focus on action recognition tasks, while largely overlooking the inherent role of motion analysis in sports and other fast-movement scenarios. To bridge this gap, we propose a real-time motion focus recognition method that estimates the subject's locomotion intention from any egocentric video. We leverage the foundation model for camera pose estimation and introduce system-level optimizations to enable efficient and scalable inference. Evaluated on a collected egocentric action dataset, our method achieves real-time performance with manageable memory consumption through a sliding batch inference strategy. This work makes motion-centric analysis practical for edge deployment and offers a complementary perspective to existing egocentric studies on sports and fast-movement activities.
翻译:从视觉-语言-动作系统到机器人技术,现有第一人称数据集主要聚焦于动作识别任务,却普遍忽视了运动分析在体育及其他快速移动场景中的内在作用。为填补这一空白,我们提出一种实时运动焦点识别方法,能够从任意第一人称视频中估计主体的移动意图。该方法利用基础模型进行相机姿态估计,并引入系统级优化以实现高效可扩展的推理。在采集的第一人称动作数据集上的评估表明,通过滑动批量推理策略,我们的方法在可管理的内存消耗下实现了实时性能。这项工作使以运动为中心的分析在边缘部署中具备实用性,并为现有关于体育及快速移动活动的第一人称研究提供了补充视角。