Image-guided object assembly represents a burgeoning research topic in computer vision. This paper introduces a novel task: translating multi-view images of a structural 3D model (for example, one constructed with building blocks drawn from a 3D-object library) into a detailed sequence of assembly instructions executable by a robotic arm. Fed with multi-view images of the target 3D model for replication, the model designed for this task must address several sub-tasks, including recognizing individual components used in constructing the 3D model, estimating the geometric pose of each component, and deducing a feasible assembly order adhering to physical rules. Establishing accurate 2D-3D correspondence between multi-view images and 3D objects is technically challenging. To tackle this, we propose an end-to-end model known as the Neural Assembler. This model learns an object graph where each vertex represents recognized components from the images, and the edges specify the topology of the 3D model, enabling the derivation of an assembly plan. We establish benchmarks for this task and conduct comprehensive empirical evaluations of Neural Assembler and alternative solutions. Our experiments clearly demonstrate the superiority of Neural Assembler.
翻译:图像引导的物体组装是计算机视觉中一个新兴的研究课题。本文提出了一项新任务:将结构三维模型(例如,使用3D物体库中的积木构建的模型)的多视角图像翻译成可由机械臂执行的详细组装指令序列。给定目标3D模型的多视角图像进行复制,为此任务设计的模型必须解决若干子任务,包括识别构建3D模型所用的单个组件、估计每个组件的几何姿态,以及推导出遵循物理规则的可行组装顺序。在多视角图像与3D物体之间建立精确的2D-3D对应关系在技术上具有挑战性。为解决这一问题,我们提出了一种端到端模型,称为神经组装机(Neural Assembler)。该模型学习一个物体图,其中每个顶点代表从图像中识别出的组件,边指定3D模型的拓扑结构,从而能够推导出组装方案。我们为此任务建立了基准测试,并对神经组装机及其替代方案进行了全面的实验评估。实验清楚地证明了神经组装机的优越性。