The prevalent paradigm in robot learning attempts to generalize across environments, embodiments, and tasks with language prompts at runtime. A fundamental tension limits this approach: language is often too abstract to guide the concrete physical understanding required for robust manipulation. In this work, we introduce Contact-Anchored Policies (CAP), which replace language conditioning with points of physical contact in space. Simultaneously, we structure CAP as a library of modular utility models rather than a monolithic generalist policy. This factorization allows us to implement a real-to-sim iteration cycle: we build EgoGym, a lightweight simulation benchmark, to rapidly identify failure modes and refine our models and datasets prior to real-world deployment. We show that by conditioning on contact and iterating via simulation, CAP generalizes to novel environments and embodiments out of the box on three fundamental manipulation skills while using only 23 hours of demonstration data, and outperforms large, state-of-the-art VLAs in zero-shot evaluations by 56%. All model checkpoints, codebase, hardware, simulation, and datasets will be open-sourced. Project page: https://cap-policy.github.io/
翻译:机器人学习的主流范式试图在运行时通过语言提示实现跨环境、跨具身形态及跨任务的泛化。一个根本性的矛盾限制了这一方法:语言往往过于抽象,难以指导鲁棒操控所需的具体物理理解。本工作中,我们提出了接触锚定策略,其以空间中的物理接触点取代语言条件化。同时,我们将CAP构建为一个模块化效用模型库,而非单一的整体通用策略。这种分解使我们能够实现一个从真实到仿真的迭代循环:我们构建了轻量级仿真基准EgoGym,以便在真实世界部署前快速识别故障模式并优化模型与数据集。我们证明,通过接触条件化并结合仿真迭代,CAP仅使用23小时的演示数据,即可在三种基础操控技能上实现对新环境与新具身形态的开箱即用泛化,并在零样本评估中以56%的优势超越大型先进视觉语言动作模型。所有模型检查点、代码库、硬件配置、仿真环境及数据集均将开源。项目页面:https://cap-policy.github.io/