Deformable object manipulation remains a key challenge in developing autonomous robotic systems that can be successfully deployed in real-world scenarios. In this work, we explore the challenges of deformable object manipulation through the task of sculpting clay into 3D shapes. We propose the first coarse-to-fine autonomous sculpting system in which the sculpting agent first selects how many and where to place discrete chunks of clay into the workspace to create a coarse shape, and then iteratively refines the shape with sequences of deformation actions. We leverage large language models for sub-goal generation, and train a point cloud region-based action model to predict robot actions from the desired point cloud sub-goals. Additionally, our method is the first autonomous sculpting system that is a real-world text-to-3D shaping pipeline without any explicit 3D goals or sub-goals provided to the system. We demonstrate our method is able to successfully create a set of simple shapes solely from text-based prompting. Furthermore, we explore rigorously how to best quantify success for the text-to-3D sculpting task, and compare existing text-image and text-point cloud similarity metrics to human evaluations for this task. For experimental videos, human evaluation details, and full prompts, please see our project website: https://sites.google.com/andrew.cmu.edu/hierarchicalsculpting
翻译:可变形物体操控仍然是开发能够成功应用于现实场景的自主机器人系统的关键挑战。本研究通过将黏土塑造成三维形状的任务,深入探讨可变形物体操控的挑战。我们提出了首个从粗到精的自主塑形系统:塑形智能体首先选择在工作空间中放置黏土块的数量与位置以构建粗略形状,随后通过连续的变形动作迭代优化形状。我们利用大语言模型生成子目标,并训练基于点云区域的动作模型,以从期望的点云子目标预测机器人动作。此外,我们的方法是首个无需向系统提供任何显式三维目标或子目标的、可在现实世界中运行的文本到三维塑形流程。实验表明,我们的方法能够仅通过文本提示成功创建一系列简单形状。进一步地,我们系统性地探讨了如何最佳量化文本到三维塑形任务的成功度,并将现有文本-图像与文本-点云相似性度量与该任务的人类评估结果进行对比。实验视频、人工评估详情及完整提示请参见项目网站:https://sites.google.com/andrew.cmu.edu/hierarchicalsculpting