We present a framework for assistive robot manipulation, which focuses on two fundamental challenges: first, efficiently adapting large-scale models to downstream scene affordance understanding tasks, especially in daily living scenarios where gathering multi-task data involving humans requires strenuous effort; second, effectively learning robot trajectories by grounding the visual affordance model. We tackle the first challenge by employing a parameter-efficient prompt tuning method that prepends learnable text prompts to the frozen vision model to predict manipulation affordances in multi-task scenarios. Then we propose to learn robot trajectories guided by affordances in a supervised Flow Matching method. Flow matching represents a robot visuomotor policy as a conditional process of flowing random waypoints to desired robot trajectories. Finally, we introduce a real-world dataset with 10 tasks across Activities of Daily Living to test our framework. Our extensive evaluation highlights that the proposed prompt tuning method for learning manipulation affordance with language prompter achieves competitive performance and even outperforms other finetuning protocols across data scales, while satisfying parameter efficiency. Learning multi-task robot trajectories with flow matching policy also leads to consistently better generalization performance and faster inference than alternative behavior cloning methods, especially given multimodal robot action distributions. Our framework seamlessly unifies affordance model learning and trajectory generation with flow matching for robot manipulation.
翻译:本文提出了一种辅助机器人操作框架,重点解决两个核心挑战:首先,如何高效地将大规模模型适配至下游场景可供性理解任务,特别是在日常生活场景中,采集涉及人类的多任务数据需要耗费巨大努力;其次,如何通过视觉可供性模型的具身化来有效学习机器人轨迹。针对第一个挑战,我们采用参数高效的提示调优方法,将可学习的文本提示前缀添加至冻结的视觉模型中,以预测多任务场景下的操作可供性。随后,我们提出在监督式流匹配方法的框架下,以可供性为指导学习机器人轨迹。流匹配方法将机器人视觉运动策略表示为从随机路径点流向目标机器人轨迹的条件过程。最后,我们引入包含10项日常生活活动的真实世界数据集来验证所提框架。大量实验结果表明:所提出的语言提示器学习操作可供性的提示调优方法在不同数据规模下均取得具有竞争力的性能,甚至优于其他微调方案,同时满足参数高效性要求。采用流匹配策略学习多任务机器人轨迹,相较于其他行为克隆方法,能够持续获得更优的泛化性能和更快的推理速度,特别是在处理多模态机器人动作分布时表现突出。本框架通过流匹配技术,将可供性模型学习与轨迹生成无缝整合于机器人操作系统之中。