Recent works in Task and Motion Planning (TAMP) show that training control policies on language-supervised robot trajectories with quality labeled data markedly improves agent task success rates. However, the scarcity of such data presents a significant hurdle to extending these methods to general use cases. To address this concern, we present an automated framework to decompose trajectory data into temporally bounded and natural language-based descriptive sub-tasks by leveraging recent prompting strategies for Foundation Models (FMs) including both Large Language Models (LLMs) and Vision Language Models (VLMs). Our framework provides both time-based and language-based descriptions for lower-level sub-tasks that comprise full trajectories. To rigorously evaluate the quality of our automatic labeling framework, we contribute an algorithm SIMILARITY to produce two novel metrics, temporal similarity and semantic similarity. The metrics measure the temporal alignment and semantic fidelity of language descriptions between two sub-task decompositions, namely an FM sub-task decomposition prediction and a ground-truth sub-task decomposition. We present scores for temporal similarity and semantic similarity above 90%, compared to 30% of a randomized baseline, for multiple robotic environments, demonstrating the effectiveness of our proposed framework. Our results enable building diverse, large-scale, language-supervised datasets for improved robotic TAMP.
翻译:近期任务与运动规划(TAMP)研究表明,在高质量标注数据的语言监督机器人轨迹上训练控制策略,可显著提升智能体任务成功率。然而,此类数据的稀缺性成为将这些方法推广至通用场景的主要障碍。为解决这一问题,我们提出一种自动化框架,通过利用基础模型(FMs)(包括大型语言模型(LLMs)和视觉语言模型(VLMs))的最新提示策略,将轨迹数据分解为具有时间边界且基于自然语言描述的子任务。该框架为构成完整轨迹的低层子任务同时提供基于时间和基于语言的描述。为严格评估自动标注框架的质量,我们提出SIMILARITY算法,生成两项新型指标:时序相似度与语义相似度。这两项指标衡量两个子任务分解(即FM预测的子任务分解与真实子任务分解)之间语言描述的时间对齐与语义保真度。在多种机器人环境中,我们展示了时序相似度与语义相似度均超过90%的得分(随机基线仅为30%),验证了所提框架的有效性。该研究成果有助于构建多样化、大规模的语言监督数据集,从而改进机器人TAMP性能。