Embodied foundation models have achieved significant breakthroughs in robotic manipulation, yet they still depend heavily on large-scale robot demonstrations. Although recent works have explored leveraging human data to alleviate this dependency, effectively extracting transferable knowledge remains a significant challenge due to the inherent embodiment gap between human and robot. We argue that the intention underlying human actions can serve as a powerful intermediate representation for bridging this gap. In this paper, we introduce a novel framework that explicitly learns and transfers human intention to facilitate robotic manipulation. Specifically, we model intention through gaze, as it naturally precedes physical actions and serves as an observable proxy for human intent. Our model is first pretrained on a large-scale egocentric human dataset to capture human intention and its synergy with action, followed by finetuning on a small set of robot and human data. During inference, the model adopts a Chain-of-Thought reasoning paradigm, sequentially predicting intention before executing the action. Extensive evaluations in simulation and real-world settings, across long-horizon and fine-grained tasks, and under few-shot and robustness benchmarks, show that our method consistently outperforms strong baselines, generalizes better, and achieves state-of-the-art performance. Project page: https://gazevla.github.io .
翻译:具身基础模型在机器人操作领域取得了显著突破,但仍高度依赖大规模机器人演示数据。尽管近期研究尝试利用人类数据缓解这一依赖,但由于人与机器人之间存在天然的具身鸿沟,如何有效提取可迁移知识仍是重大挑战。我们认为,人类行为背后的意图可作为弥合这一鸿沟的强大中间表征。本文提出一种显式学习与迁移人类意图的新型框架以促进机器人操作。具体而言,我们通过注视(gaze)建模意图——因其天然先于物理动作发生,且作为人类意图的可观测代理。所提模型首先在大规模自我中心人类数据集上预训练,以捕获人类意图及其与动作的协同关系,随后在少量机器人与人类数据上进行微调。推理阶段,模型采用思维链(Chain-of-Thought)推理范式,在动作执行前依次预测意图。在仿真与真实环境中的长时域与精细任务评估,以及少样本与鲁棒性基准测试中,我们的方法持续优于强基线方法,展现出更优的泛化能力,并取得最先进性能。项目主页:https://gazevla.github.io。