Learning from human video demonstrations remains challenging due to noisy hand-object interactions, unseen objects with partial observation, and cross-embodiment discrepancy. To address these challenges, we present \textit{HOWTransfer} (\emph{H}and-\emph{O}bject \emph{O}pen-\emph{W}orld Transfer), a hand-centric framework that distills human demonstrations into contact-aware, taxonomy-informed, and diverse robotic trajectories. Instead of relying on object-specific descriptions, vision-language queries, or explicit object-state tracking, \emph{HOWTransfer} recovers temporally consistent 3D hand motion and localizes temporal contact intervals by reasoning over observed hand-object interaction cues. The localized contact onsets are then used to retarget human grasp intent into multi-modal parallel-jaw grasp hypotheses, which are propagated along the recovered wrist trajectory to generate robot-executable motions. Finally, a trajectory editing stage refines contact alignment and produces diverse executable variants from a single demonstration. Experiments across diverse manipulation tasks show that \emph{HOWTransfer} enables accurate contact localization and high-quality robot motion retargeting with $86\%$ success, which is preferred over teleoperated trajectories in a blinded preference study.
翻译:从人类视频演示中学习仍面临挑战,包括手-物体交互中的噪声、部分观测下的未知物体以及跨实体差异。为解决这些问题,我们提出 \textit{HOWTransfer}(\emph{手}-物体\emph{开放世界}迁移),这是一种以手部为中心的框架,能够将人类演示提炼为具有接触感知、分类学指导且多样化的机器人轨迹。\emph{HOWTransfer} 不依赖物体特定描述、视觉语言查询或显式物体状态追踪,而是通过推理观察到的交互线索,恢复时间上一致的三维手部运动并定位时间维度的接触区间。随后,定位的接触起始点被用于将人类抓取意图重定向至多模态平行夹爪抓取假设,这些假设沿恢复的手腕轨迹传播以生成机器人可执行运动。最后,轨迹编辑阶段优化接触对齐,并从单一演示中生成多样化的可执行变体。跨多种操作任务的实验表明,\emph{HOWTransfer} 能够实现精确的接触定位和高质量的机器人运动重定向,成功率达到 \(86\%\),在盲选偏好研究中优于遥操作轨迹。