Reverse engineering 3D computer-aided design (CAD) models from images is an important task for many downstream applications including interactive editing, manufacturing, architecture, robotics, etc. The difficulty of the task lies in vast representational disparities between the CAD output and the image input. CAD models are precise, programmatic constructs that involves sequential operations combining discrete command structure with continuous attributes -- making it challenging to learn and optimize in an end-to-end fashion. Concurrently, input images introduce inherent challenges such as photo-metric variability and sensor noise, complicating the reverse engineering process. In this work, we introduce a novel approach that conditionally factorizes the task into two sub-problems. First, we leverage large foundation models, particularly GPT-4V, to predict the global discrete base structure with semantic information. Second, we propose TrAssembler that conditioned on the discrete structure with semantics predicts the continuous attribute values. To support the training of our TrAssembler, we further constructed an annotated CAD dataset of common objects from ShapeNet. Putting all together, our approach and data demonstrate significant first steps towards CAD-ifying images in the wild. Our project page: https://anonymous123342.github.io/
翻译:从图像逆向工程三维计算机辅助设计(CAD)模型是许多下游应用(包括交互式编辑、制造、建筑、机器人等)的重要任务。该任务的难点在于CAD输出与图像输入之间存在巨大的表示差异。CAD模型是精确的程序化构造,涉及将离散命令结构与连续属性相结合的序列操作——这使得以端到端方式学习和优化具有挑战性。同时,输入图像引入了固有的挑战,如光度变化和传感器噪声,进一步复杂化了逆向工程过程。在本工作中,我们提出了一种新颖方法,将任务有条件地分解为两个子问题。首先,我们利用大型基础模型,特别是GPT-4V,来预测具有语义信息的全局离散基础结构。其次,我们提出了TrAssembler,它在具有语义的离散结构条件下预测连续属性值。为了支持TrAssembler的训练,我们进一步从ShapeNet构建了一个常见物体的标注CAD数据集。综合来看,我们的方法和数据展示了向真实世界图像CAD化迈出的重要第一步。项目页面:https://anonymous123342.github.io/