Articulated objects are fundamental for robotics, simulation of physics, and interactive virtual environments. However, reconstructing them from visual input remains challenging, as it requires jointly inferring both part geometry and kinematic structure. We present, an end-to-end autoregressive framework that directly generates executable articulated object models from visual observations. Given image and object-level 3D cues, our method sequentially produces part geometries and their associated joint parameters, resulting in complete URDF models without reliance on multi-stage pipelines. The generation proceeds until the model determines that all parts have been produced, automatically inferring complete geometry and kinematics. Building on this capability, we enable a new Real-Follow-Sim paradigm, where high-fidelity digital twins constructed from visual observations allow policies trained and tested purely in simulation to transfer to real robots without online adaptation. Experiments on large-scale articulated object benchmarks and real-world robotic tasks demonstrate that outperforms prior methods in geometric reconstruction quality, joint parameter accuracy, and physical executability.
翻译:铰接物体是机器人学、物理仿真和交互式虚拟环境的基础。然而,从视觉输入中重建它们仍然具有挑战性,因为这需要同时推断部件几何和运动学结构。我们提出了一个端到端的自回归框架,能够直接从视觉观测生成可执行的铰接物体模型。给定图像和物体级别的三维线索,我们的方法顺序生成部件几何及其关联的关节参数,从而产生完整的URDF模型,无需依赖多阶段流程。该生成过程持续进行,直到模型判定所有部件均已生成,自动推断出完整的几何与运动学信息。基于此能力,我们实现了一种新的"实-仿-随"范式,即通过视觉观测构建的高保真数字孪生体,使得完全在仿真中训练和测试的策略能够无需在线适应即可迁移到真实机器人上。在大规模铰接物体基准测试和真实世界机器人任务上的实验表明,该方法在几何重建质量、关节参数精度和物理可执行性方面均优于现有方法。