Human videos contain rich manipulation priors, but using them for robot learning remains difficult because raw observations entangle scene understanding, human motion, and embodiment-specific action. We introduce MoT-HRA, a hierarchical vision-language-action framework that learns human-intention priors from large-scale human demonstrations. We first curate HA-2.2M, a 2.2M-episode action-language dataset reconstructed from heterogeneous human videos through hand-centric filtering, spatial reconstruction, temporal segmentation, and language alignment. On top of this dataset, MoT-HRA factorizes manipulation into three coupled experts: a vision-language expert predicts an embodiment-agnostic 3D trajectory, an intention expert models MANO-style hand motion as a latent human-motion prior, and a fine expert maps the intention-aware representation to robot action chunks. A shared-attention trunk and read-only key-value transfer allow downstream control to use human priors while limiting interference with upstream representations. Experiments on hand motion generation, simulated manipulation, and real-world robot tasks show that MoT-HRA improves motion plausibility and robust control under distribution shift.
翻译:人类视频包含丰富的操作先验,但如何将其用于机器人学习仍面临挑战,因为原始观测数据同时涉及场景理解、人体运动以及具体具身形态的动作。我们提出MoT-HRA——一种层次化视觉-语言-动作框架,能够从大规模人类演示中学习人类意图先验。首先构建HA-2.2M数据集,该数据集包含220万条动作-语言片段,通过手部中心过滤、空间重建、时间分割和语言对齐技术,从异构人类视频中重建而成。在此数据集基础上,MoT-HRA将操作分解为三个耦合专家模块:视觉语言专家预测与具身形态无关的三维轨迹,意图专家将MANO风格的手部运动建模为潜在人类运动先验,精细动作专家将意图感知表征映射为机器人动作序列。采用共享注意力主干与只读键值传输机制,使下游控制能利用人类先验,同时限制对上游表征的干扰。在手部运动生成、仿真操作和真实机器人任务上的实验表明,MoT-HRA能够在分布偏移场景下提升运动合理性及鲁棒控制能力。