Current feed-forward 3D/4D reconstruction systems rely on dense geometry and pose supervision -- expensive to obtain at scale and particularly scarce for dynamic real-world scenes. We present Flow3r, a framework that augments visual geometry learning with dense 2D correspondences (`flow') as supervision, enabling scalable training from unlabeled monocular videos. Our key insight is that the flow prediction module should be factored: predicting flow between two images using geometry latents from one and pose latents from the other. This factorization directly guides the learning of both scene geometry and camera motion, and naturally extends to dynamic scenes. In controlled experiments, we show that factored flow prediction outperforms alternative designs and that performance scales consistently with unlabeled data. Integrating factored flow into existing visual geometry architectures and training with ${\sim}800$K unlabeled videos, Flow3r achieves state-of-the-art results across eight benchmarks spanning static and dynamic scenes, with its largest gains on in-the-wild dynamic videos where labeled data is most scarce.
翻译:当前的前馈式3D/4D重建系统依赖于稠密几何与姿态监督——这些数据在大规模获取时成本高昂,且对于动态真实场景尤为稀缺。我们提出Flow3r框架,该框架通过稠密二维对应关系(“流”)作为监督信号来增强视觉几何学习,从而能够从未标注的单目视频中进行可扩展训练。我们的核心见解是:流预测模块应当被分解设计:利用来自一幅图像的几何潜在变量与来自另一幅图像的姿态潜在变量来预测两幅图像间的流。这种分解方式直接引导场景几何与相机运动的学习,并能自然地扩展到动态场景。在对照实验中,我们证明分解式流预测优于其他设计方案,且性能随未标注数据量的增加持续提升。将分解式流集成到现有视觉几何架构中,并使用约80万段未标注视频进行训练后,Flow3r在涵盖静态与动态场景的八个基准测试中均取得了最先进的结果,其中在标注数据最为稀缺的真实世界动态视频上获得了最大的性能提升。