Embodied foundation models have achieved significant breakthroughs in robotic manipulation, yet they still depend heavily on large-scale robot demonstrations. Although recent works have explored leveraging human data to alleviate this dependency, effectively extracting transferable knowledge remains a significant challenge due to the inherent embodiment gap between human and robot. We argue that the intention underlying human actions can serve as a powerful intermediate representation for bridging this gap. In this paper, we introduce a novel framework that explicitly learns and transfers human intention to facilitate robotic manipulation. Specifically, we model intention through gaze, as it naturally precedes physical actions and serves as an observable proxy for human intent. Our model is first pretrained on a large-scale egocentric human dataset to capture human intention and its synergy with action, followed by finetuning on a small set of robot and human data. During inference, the model adopts a Chain-of-Thought reasoning paradigm, sequentially predicting intention before executing the action. Extensive evaluations in simulation and real-world settings, across long-horizon and fine-grained tasks, and under few-shot and robustness benchmarks, show that our method consistently outperforms strong baselines, generalizes better, and achieves state-of-the-art performance.
翻译:具身基础模型在机器人操作领域取得了显著突破,但仍高度依赖大规模机器人演示数据。尽管近期研究尝试利用人类数据来缓解这一依赖性,但由于人类与机器人之间存在天然的具身鸿沟,如何有效提取可迁移知识仍是重大挑战。我们提出,人类行为背后的意图可以作为弥合该鸿沟的强大中间表征。本文引入了一个显式学习并迁移人类意图以促进机器人操作的新框架。具体而言,我们通过视线来建模意图,因为视线在物理动作之前自然发生,是人类意图的可观测代理。该模型首先在大规模自我中心人类数据集上预训练,以捕捉人类意图及其与动作的协同关系,随后在少量机器人和人类数据上微调。在推理阶段,模型采用思维链推理范式,在执行动作前顺序预测意图。在仿真与真实世界的长时域及细粒度任务中,以及在少样本和鲁棒性基准测试中的广泛评估表明,我们的方法始终优于强基线方法,具有更强的泛化能力,并实现了最先进的性能。