Humans learn locomotion through visual observation, interpreting visual content first before imitating actions. However, state-of-the-art humanoid locomotion systems rely on either curated motion capture trajectories or sparse text commands, leaving a critical gap between visual understanding and control. Text-to-motion methods suffer from semantic sparsity and staged pipeline errors, while video-based approaches only perform mechanical pose mimicry without genuine visual understanding. We propose RoboMirror, the first retargeting-free video-to-locomotion framework embodying "understand before you imitate". Leveraging VLMs, it distills raw egocentric/third-person videos into visual motion intents, which directly condition a diffusion-based policy to generate physically plausible, semantically aligned locomotion without explicit pose reconstruction or retargeting. Extensive experiments validate the effectiveness of RoboMirror, it enables telepresence via egocentric videos, drastically reduces third-person control latency by 80%, and achieves a 3.7% higher task success rate than baselines. By reframing humanoid control around video understanding, we bridge the visual understanding and action gap.
翻译:人类通过视觉观察学习步态,先理解视觉内容再模仿动作。然而,当前最先进的人形机器人步态系统要么依赖精心编排的运动捕捉轨迹,要么依赖稀疏的文本指令,在视觉理解与控制之间留下了关键空白。文本到动作方法受限于语义稀疏性和分阶段流水线误差,而基于视频的方法仅进行机械的姿态模仿,缺乏真正的视觉理解。我们提出RoboMirror,首个无需重定向的“视频到步态”框架,其核心是“先理解后模仿”。该框架利用视觉语言模型,将原始第一人称/第三人称视频提炼为视觉运动意图,并直接作为基于扩散策略的条件输入,以生成物理合理、语义对齐的步态,无需显式的姿态重建或重定向。大量实验验证了RoboMirror的有效性:它支持通过第一人称视频实现远程临场感,将第三人称控制延迟大幅降低80%,并比基线方法获得高出3.7%的任务成功率。通过围绕视频理解重构人形机器人控制,我们弥合了视觉理解与行动执行之间的鸿沟。