Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.
翻译:先前关于三维手部轨迹预测的研究受限于以下两方面:数据集将运动与语义监督解耦,模型对推理与动作的关联较弱。为解决这些问题,我们首先提出了EgoMAN数据集,这是一个用于交互阶段感知的三维手部轨迹预测的大规模第一人称数据集,包含21.9万条6自由度轨迹和300万组用于语义、空间及运动推理的结构化问答对。随后,我们提出了EgoMAN模型,这是一个通过轨迹-令牌接口连接视觉-语言推理与运动生成的推理到运动框架。通过渐进式训练使推理与运动动态对齐,我们的方法能够生成精确且具有阶段感知的轨迹,并在真实场景中展现出良好的泛化能力。