Large-scale real-world robot data collection is a prerequisite for bringing robots into everyday deployment. However, existing pipelines often rely on specialized handheld devices to bridge the embodiment gap, which not only increases operator burden and limits scalability, but also makes it difficult to capture the naturally coordinated perception-manipulation behaviors of human daily interaction. This challenge calls for a more natural system that can faithfully capture human manipulation and perception behaviors while enabling zero-shot transfer to robotic platforms. We introduce ActiveGlasses, a system for learning robot manipulation from ego-centric human demonstrations with active vision. A stereo camera mounted on smart glasses serves as the sole perception device for both data collection and policy inference: the operator wears it during bare-hand demonstrations, and the same camera is mounted on a 6-DoF perception arm during deployment to reproduce human active vision. To enable zero-transfer, we extract object trajectories from demonstrations and use an object-centric point-cloud policy to jointly predict manipulation and head movement. Across several challenging tasks involving occlusion and precise interaction, ActiveGlasses achieves zero-shot transfer with active vision, consistently outperforms strong baselines under the same hardware setup, and generalizes across two robot platforms.
翻译:大规模真实世界机器人数据收集是推动机器人日常部署的前提。然而,现有流程通常依赖专用手持设备来弥补具身差距,这不仅增加了操作负担并限制了可扩展性,还难以捕捉人类日常交互中自然协调的感知-操作行为。这一挑战亟需一种更自然的系统,既能忠实捕捉人类操作与感知行为,又能实现向机器人平台的零样本迁移。我们提出ActiveGlasses——一个通过带有主动视觉的自我中心人类示教来学习机器人操作的系统。安装于智能眼镜上的立体摄像头作为数据收集与策略推理的唯一感知设备:操作员在裸手示教时佩戴该眼镜,部署时则将同一摄像头安装于六自由度感知臂上,以复现人类的主动视觉。为实现零样本迁移,我们从示教中提取物体轨迹,并采用以物体为中心的点云策略联合预测操作与头部运动。在涉及遮挡与精细交互的多项挑战性任务中,ActiveGlasses通过主动视觉实现了零样本迁移,在相同硬件设置下持续超越强基线方法,并可在两个机器人平台上泛化应用。