We present Real2Code, a novel approach to reconstructing articulated objects via code generation. Given visual observations of an object, we first reconstruct its part geometry using an image segmentation model and a shape completion model. We then represent the object parts with oriented bounding boxes, which are input to a fine-tuned large language model (LLM) to predict joint articulation as code. By leveraging pre-trained vision and language models, our approach scales elegantly with the number of articulated parts, and generalizes from synthetic training data to real world objects in unstructured environments. Experimental results demonstrate that Real2Code significantly outperforms previous state-of-the-art in reconstruction accuracy, and is the first approach to extrapolate beyond objects' structural complexity in the training set, and reconstructs objects with up to 10 articulated parts. When incorporated with a stereo reconstruction model, Real2Code also generalizes to real world objects from a handful of multi-view RGB images, without the need for depth or camera information.
翻译:我们提出Real2Code,一种通过代码生成重建铰接物体的新方法。给定物体的视觉观测,我们首先使用图像分割模型和形状补全模型重建其部件几何结构。随后,我们用定向包围盒表示物体部件,并将其输入微调后的大语言模型,以代码形式预测关节铰接方式。通过利用预训练的视觉与语言模型,我们的方法能够优雅地适应铰接部件数量的变化,并将合成训练数据的泛化能力拓展至非结构化环境中的真实物体。实验结果表明,Real2Code在重建精度上显著超越先前的最先进方法,并首次实现了对训练集外结构复杂性的泛化,成功重建了多达10个铰接部件的物体。当与立体重建模型结合时,Real2Code仅需少量多视角RGB图像即可泛化至真实物体,无需深度或相机信息。