We present MotionCrafter, a framework that leverages video generators to jointly reconstruct 4D geometry and estimate dense motion from a monocular video. The key idea is a joint representation of dense 3D point maps and 3D scene flows in a shared coordinate system, together with a 4D VAE tailored to learn this representation effectively. Unlike prior work that strictly aligns 3D values and latents with RGB VAE latents-despite their fundamentally different distributions-we show that such alignment is unnecessary and can hurt performance. Instead, we propose a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments on multiple datasets show that MotionCrafter achieves state-of-the-art performance in both geometry reconstruction and dense scene flow estimation, delivering 38.64% and 25.0% improvements in geometry and motion reconstruction, respectively, all without any post-optimization. Project page: https://ruijiezhu94.github.io/MotionCrafter_Page
翻译:我们提出了MotionCrafter框架,该框架利用视频生成器从单目视频中联合重建4D几何并估计密集运动。核心思想是在统一坐标系中联合表示密集3D点图与3D场景流,并采用专门设计的4D VAE高效学习该表示。不同于先前研究将3D值与潜在变量严格对齐至RGB VAE潜在空间(尽管二者分布存在根本差异),我们证明这种对齐不仅非必要还会损害性能。为此,我们提出新的数据归一化与VAE训练策略,能更有效地迁移扩散先验并显著提升重建质量。在多个数据集上的大量实验表明,MotionCrafter在几何重建与密集场景流估计方面均达到最优性能,几何与运动重建分别提升38.64%和25.0%,且无需任何后处理优化。项目页面:https://ruijiezhu94.github.io/MotionCrafter_Page