sim2art: Accurate Articulated Object Modeling from a Single Video using Synthetic Training Data Only

Understanding articulated objects from monocular video is a crucial yet challenging task in robotics and digital twin creation. Existing methods often rely on complex multi-view setups, high-fidelity object scans, or fragile long-term point tracks that frequently fail in casual real-world captures. In this paper, we present sim2art, a data-driven framework that recovers the 3D part segmentation and joint parameters of articulated objects from a single monocular video captured by a freely moving camera. Our core insight is a robust representation based on per-frame surface point sampling, which we augment with short-term scene flow and DINOv3 semantic features. Unlike previous works that depend on error-prone long-term correspondences, our representation is easy to obtain and exhibits a negligible difference between simulation and reality without requiring domain adaptation. Also, by construction, our method relies on single-viewpoint visibility, ensuring that the geometric representation remains consistent across synthetic and real data despite noise and occlusions. Leveraging a suitable Transformer-based architecture, sim2art is trained exclusively on synthetic data yet generalizes strongly to real-world sequences. To address the lack of standardized benchmarks in the field, we introduce two datasets featuring a significantly higher diversity of object categories and instances than prior work. Our evaluations show that sim2art effectively handles large camera motions and complex articulations, outperforming state-of-the-art optimization-based and tracking-dependent methods. sim2art offers a scalable solution that can be easily extended to new object categories without the need for cumbersome real-world annotations. Project webpage: https://aartykov.github.io/sim2art/

翻译：从单目视频中理解关节物体是机器人学和数字孪生创建中关键但具有挑战性的任务。现有方法通常依赖于复杂的多视角设置、高保真物体扫描或脆弱的长期点轨迹跟踪，这些方法在随意的现实世界捕获中经常失效。本文提出sim2art，一种数据驱动框架，可从自由移动相机捕获的单目视频中恢复关节物体的三维部件分割与关节参数。我们的核心洞见是基于逐帧表面点采样的鲁棒表示，并通过短期场景流与DINOv3语义特征进行增强。与以往依赖易出错长期对应关系的方法不同，我们的表示易于获取，且在仿真与现实之间展现出可忽略的差异，无需进行域适应。此外，通过设计，我们的方法依赖于单视点可见性，确保了几何表示在合成数据与真实数据间保持一致，即使存在噪声与遮挡。借助合适的基于Transformer的架构，sim2art仅使用合成数据进行训练，却能强有力地泛化到真实世界序列。针对该领域缺乏标准化基准的问题，我们引入了两个数据集，其物体类别与实例的多样性显著超过先前工作。评估结果表明，sim2art能有效处理大幅相机运动与复杂关节动作，性能优于当前最先进的基于优化的方法和依赖跟踪的方法。sim2art提供了一种可扩展的解决方案，可轻松扩展到新的物体类别，而无需繁琐的真实世界标注。项目网页：https://aartykov.github.io/sim2art/