Text to Robotic Assembly of Multi Component Objects using 3D Generative AI and Vision Language Models

Advances in 3D generative AI have enabled the creation of physical objects from text prompts, but challenges remain in creating objects involving multiple component types. We present a pipeline that integrates 3D generative AI with vision-language models (VLMs) to enable the robotic assembly of multi-component objects from natural language. Our method leverages VLMs for zero-shot, multi-modal reasoning about geometry and functionality to decompose AI-generated meshes into multi-component 3D models using predefined structural and panel components. We demonstrate that a VLM is capable of determining which mesh regions need panel components in addition to structural components, based on the object's geometry and functionality. Evaluation across test objects shows that users preferred the VLM-generated assignments 90.6% of the time, compared to 59.4% for rule-based and 2.5% for random assignment. Lastly, the system allows users to refine component assignments through conversational feedback, enabling greater human control and agency in making physical objects with generative AI and robotics.

翻译：3D生成式AI的进步使得从文本提示生成物理对象成为可能，但在涉及多种组件类型的对象创建方面仍存在挑战。我们提出了一种将3D生成式AI与视觉语言模型（VLMs）相结合的流程，以实现基于自然语言的多组件物体机器人装配。该方法利用VLMs进行零样本、多模态的几何与功能推理，将AI生成的网格模型分解为使用预定义结构件和面板件的多组件3D模型。我们证明，VLM能够根据对象的几何形状和功能，确定哪些网格区域需要面板件以及结构件。在测试对象上的评估表明，用户对VLM生成的组件分配方案的偏好率达到90.6%，而基于规则的方案为59.4%，随机分配方案仅为2.5%。最后，该系统允许用户通过对话反馈优化组件分配，从而在利用生成式AI和机器人技术制造物理对象时赋予人类更大的控制权和自主性。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

面向具身操作的高效视觉–语言–动作模型：系统综述

专知会员服务

26+阅读 · 2025年10月22日

【CVPR2025】CrayonRobo：面向机器人操作的以对象为中心的提示驱动视觉-语言-动作模型

专知会员服务

11+阅读 · 2025年5月6日

人形机器人与AI大模型之Robot+AI的Transformer之旅

专知会员服务

30+阅读 · 2024年11月7日

【CVPR2024】ViewDiff: 3D一致的图像生成与文本到图像模型

专知会员服务

30+阅读 · 2024年3月10日