We introduce MotionCrafter, a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense motion from a monocular video. The core of our method is a novel joint representation of dense 3D point maps and 3D scene flows in a shared coordinate system, and a novel 4D VAE to effectively learn this representation. Unlike prior work that forces the 3D value and latents to align strictly with RGB VAE latents-despite their fundamentally different distributions-we show that such alignment is unnecessary and leads to suboptimal performance. Instead, we introduce a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments across multiple datasets demonstrate that MotionCrafter achieves state-of-the-art performance in both geometry reconstruction and dense scene flow estimation, delivering 38.64% and 25.0% improvements in geometry and motion reconstruction, respectively, all without any post-optimization. Project page: https://ruijiezhu94.github.io/MotionCrafter_Page
翻译:我们提出MotionCrafter,这是一个基于视频扩散的框架,能够从单目视频中联合重建4D几何并估计稠密运动。我们方法的核心在于一种新颖的联合表示方法,它将稠密3D点云图与3D场景流在共享坐标系中统一表示,并引入一种新颖的4D VAE来高效学习该表示。先前的工作强制要求3D值与潜在变量严格对齐RGB VAE的潜在变量——尽管它们的分布存在本质差异——而我们证明这种对齐是不必要的,且会导致次优性能。相反,我们提出了一种新的数据归一化与VAE训练策略,该策略能更好地传递扩散先验知识,并显著提升重建质量。在多个数据集上的大量实验表明,MotionCrafter在几何重建与稠密场景流估计任务上均达到了最先进的性能,在几何重建和运动重建方面分别实现了38.64%和25.0%的提升,且无需任何后优化过程。项目页面:https://ruijiezhu94.github.io/MotionCrafter_Page