Robotic generalization relies on physical intelligence: the ability to reason about state changes, contact-rich interactions, and long-horizon planning under egocentric perception and action. Vision Language Models (VLMs) are essential to Vision-Language-Action (VLA) systems, but the reliance on third-person training data creates a viewpoint gap for humanoid robots. Collecting massive robot-centric data is an ideal but impractical solution due to cost and diversity constraints. Conversely, human egocentric videos offer a highly scalable data source with rich interaction context, yet the embodiment mismatch prevents the direct application. To bridge this gap, we propose an Egocentric2Embodiment Translation Pipeline that transforms raw human egocentric videos into multi-level, schema-driven embodiment supervision with enforced evidence grounding and temporal consistency, enabling the construction of the Egocentric2Embodiment dataset (E2E-3M) at scale. An egocentric-aware embodied brain, termed PhysBrain, is obtained by training on the E2E-3M dataset. PhysBrain exhibits substantially improved egocentric understanding, particularly for planning. It provides an egocentric-aware initialization that enables more sample-efficient VLA fine-tuning and higher success rates, demonstrating effective transfer from human egocentric supervision to downstream robot control.
翻译:机器人泛化能力依赖于物理智能:即在自我中心感知与行动下,对状态变化、密集接触交互以及长时程规划进行推理的能力。视觉语言模型是视觉-语言-行动系统的核心组件,但其对第三人称训练数据的依赖为人形机器人带来了视角差异。尽管收集海量机器人中心数据是理想方案,但受限于成本与多样性,该方法并不现实。相比之下,人类自我中心视频提供了极具扩展性的数据源,且包含丰富的交互上下文,然而本体差异阻碍了其直接应用。为弥合这一鸿沟,我们提出一种自我中心到本体转换流水线,通过强制证据锚定与时间一致性,将原始人类自我中心视频转化为多层次、模式驱动的本体监督数据,从而实现了大规模自我中心到本体数据集(E2E-3M)的构建。通过在该数据集上训练,我们获得了具有自我中心感知能力的具身大脑模型——PhysBrain。PhysBrain展现出显著增强的自我中心理解能力,尤其在规划任务中表现突出。该模型提供了具备自我中心感知能力的初始化参数,使得视觉-语言-行动模型能够以更高的样本效率进行微调,并获得更优的成功率,这有效验证了从人类自我中心监督到下游机器人控制的知识迁移。