We introduce ART, Articulated Reconstruction Transformer -- a category-agnostic, feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images. Previous methods for articulated object reconstruction either rely on slow optimization with fragile cross-state correspondences or use feed-forward models limited to specific object categories. In contrast, ART treats articulated objects as assemblies of rigid parts, formulating reconstruction as part-based prediction. Our newly designed transformer architecture maps sparse image inputs to a set of learnable part slots, from which ART jointly decodes unified representations for individual parts, including their 3D geometry, texture, and explicit articulation parameters. The resulting reconstructions are physically interpretable and readily exportable for simulation. Trained on a large-scale, diverse dataset with per-part supervision, and evaluated across diverse benchmarks, ART achieves significant improvements over existing baselines and establishes a new state of the art for articulated object reconstruction from image inputs.
翻译:本文提出ART(Articulated Reconstruction Transformer)——一种与类别无关的前馈模型,能够仅从稀疏的多状态RGB图像重建完整的可动三维物体。现有可动物体重建方法要么依赖缓慢的优化过程及脆弱的状态间对应关系,要么使用局限于特定物体类别的前馈模型。与之相对,ART将可动物体视为刚性部件的组合体,将重建问题转化为基于部件的预测任务。我们新设计的Transformer架构将稀疏图像输入映射至一组可学习的部件槽,ART由此联合解码出统一表征各个部件的参数,包括其三维几何结构、纹理及显式关节参数。所得重建结果具有物理可解释性,并能直接导出用于仿真。通过在具有部件级标注的大规模多样化数据集上进行训练,并在多个基准测试中评估,ART较现有基线方法取得显著提升,为基于图像输入的可动物体重建确立了新的技术标杆。