From Vision-Language-Action (VLA) systems to robotics, existing egocentric datasets primarily focus on action recognition tasks, while largely overlooking the inherent role of motion analysis in sports and other fast-movement scenarios. To bridge this gap, we propose a real-time motion focus recognition method that estimates the subject's locomotion intention from any egocentric video. Our approach leverages the foundation model for camera pose estimation and introduces system-level optimizations to enable efficient and scalable inference. Evaluated on a collected egocentric action dataset, our method achieves real-time performance with manageable memory consumption through a sliding batch inference strategy. This work makes motion-centric analysis practical for edge deployment and offers a complementary perspective to existing egocentric studies on sports and fast-movement activities.
翻译:从视觉-语言-动作系统到机器人学,现有的第一人称数据集主要关注动作识别任务,而很大程度上忽视了运动分析在体育及其他快速运动场景中的固有作用。为弥补这一空白,我们提出一种实时运动焦点识别方法,能够从任意第一人称视频中估计主体的运动意图。该方法利用基础模型进行相机姿态估计,并通过系统级优化实现高效可扩展的推理。在采集的第一人称动作数据集上的评估表明,通过滑动批量推理策略,我们的方法以可控的内存消耗实现了实时性能。这项工作使以运动为中心的分析在边缘部署中变得实用,并为现有关于体育和快速运动活动的第一人称研究提供了补充视角。