The ability to conduct and learn from interaction and experience is a central challenge in robotics, offering a scalable alternative to labor-intensive human demonstrations. However, realizing such "play" requires (1) a policy robust to diverse, potentially out-of-distribution environment states, and (2) a procedure that continuously produces useful robot experience. To address these challenges, we introduce Tether, a method for autonomous functional play involving structured, task-directed interactions. First, we design a novel open-loop policy that warps actions from a small set of source demonstrations (<=10) by anchoring them to semantic keypoint correspondences in the target scene. We show that this design is extremely data-efficient and robust even under significant spatial and semantic variations. Second, we deploy this policy for autonomous functional play in the real world via a continuous cycle of task selection, execution, evaluation, and improvement, guided by the visual understanding capabilities of vision-language models. This procedure generates diverse, high-quality datasets with minimal human intervention. In a household-like multi-object setup, our method is the first to perform many hours of autonomous multi-task play in the real world starting from only a handful of demonstrations. This produces a stream of data that consistently improves the performance of closed-loop imitation policies over time, ultimately yielding over 1000 expert-level trajectories and training policies competitive with those learned from human-collected demonstrations.
翻译:在机器人学中,通过交互与经验进行学习和探索的能力是一个核心挑战,这为替代劳动密集型人工演示提供了一种可扩展的解决方案。然而,实现此类“游戏”需要满足两个条件:(1) 策略需对环境状态的多样性及潜在分布外情况具有鲁棒性;(2) 需建立一种能持续产生有效机器人经验的流程。为应对这些挑战,我们提出了Tether——一种用于结构化、任务导向交互的自主功能化游戏方法。首先,我们设计了一种新颖的开环策略,该策略通过将少量源演示(≤10)中的动作锚定至目标场景中的语义关键点对应关系,从而实现动作的扭曲变换。我们证明,即使在显著的空间与语义变化下,该设计仍具有极高的数据效率和鲁棒性。其次,我们借助视觉-语言模型的视觉理解能力,通过任务选择、执行、评估与改进的持续循环,在现实世界中部署该策略以实现自主功能化游戏。这一流程能以最少的人工干预生成多样化、高质量的数据集。在一个类家庭的多物体实验环境中,我们的方法首次实现了仅从少量演示出发,在现实世界中进行长达数小时的多任务自主游戏。由此产生的数据流能够持续提升闭环模仿策略的性能,最终累积超过1000条专家级轨迹,并训练出与基于人工收集演示所学习策略性能相当的策略。