We present TiPToP, an extensible modular system that combines pretrained vision foundation models with an existing Task and Motion Planner (TAMP) to solve multi-step manipulation tasks directly from input RGB images and natural-language instructions. Our system aims to be simple and easy-to-use: it can be installed and run on a standard DROID setup in under one hour and adapted to new embodiments with minimal effort. We evaluate TiPToP -- which requires zero robot data -- over 28 tabletop manipulation tasks in simulation and the real world and find it matches or outperforms $π_{0.5}\text{-DROID}$, a vision-language-action (VLA) model fine-tuned on 350 hours of embodiment-specific demonstrations. TiPToP's modular architecture enables us to analyze the system's failure modes at the component level. We analyze results from an evaluation of 173 trials and identify directions for improvement. We release TiPToP open-source to further research on modular manipulation systems and tighter integration between learning and planning. Project website and code: https://tiptop-robot.github.io
翻译:本文提出TiPToP,一种可扩展的模块化系统,该系统将预训练的视觉基础模型与现有任务与运动规划器(TAMP)相结合,能够直接从输入的RGB图像和自然语言指令中解决多步骤操作任务。我们的系统旨在简洁易用:可在标准DROID设置中于一小时内完成安装与运行,并能以最小代价适配新的机器人本体。我们在仿真和真实环境中对TiPToP——其训练无需任何机器人数据——进行了涵盖28项桌面操作任务的评估,发现其性能达到甚至超越了$π_{0.5}\text{-DROID}$(一个经过350小时本体专属演示数据微调的视觉-语言-动作模型)。TiPToP的模块化架构使我们能够在组件层面分析系统的失效模式。我们通过对173次实验评估结果的分析,明确了系统改进方向。我们将TiPToP开源发布,以推动模块化操作系统研究以及学习与规划更紧密集成的发展。项目网站与代码:https://tiptop-robot.github.io