Reconstructing and tracking dynamic 3D scenes remains a fundamental challenge in computer vision. Existing approaches often decouple geometry from motion: multi-view reconstruction methods assume static scenes, while dynamic tracking frameworks rely on explicit camera pose estimation or separate motion models. We propose Flow4R, a unified framework that treats camera-space scene flow as the central representation linking 3D structure, object motion, and camera motion. Flow4R predicts a minimal per-pixel property set-3D point position, scene flow, pose weight, and confidence-from two-view inputs using a Vision Transformer. This flow-centric formulation allows local geometry and bidirectional motion to be inferred symmetrically with a shared decoder in a single forward pass, without requiring explicit pose regressors or bundle adjustment. Trained jointly on static and dynamic datasets, Flow4R achieves state-of-the-art performance on 4D reconstruction and tracking tasks, demonstrating the effectiveness of the flow-central representation for spatiotemporal scene understanding.
翻译:动态三维场景的重建与跟踪仍然是计算机视觉领域的一项基础性挑战。现有方法通常将几何与运动解耦:多视图重建方法假设场景是静态的,而动态跟踪框架则依赖于显式的相机姿态估计或独立的运动模型。我们提出了Flow4R,一个统一的框架,它将相机空间场景流作为连接三维结构、物体运动和相机运动的核心表示。Flow4R使用Vision Transformer从双视图输入中预测一组最小化的逐像素属性——三维点位置、场景流、姿态权重和置信度。这种以流为中心的表述允许通过共享解码器在单次前向传播中对称地推断局部几何和双向运动,而无需显式的姿态回归器或光束法平差。通过在静态和动态数据集上联合训练,Flow4R在4D重建和跟踪任务上实现了最先进的性能,证明了以流为中心的表征对于时空场景理解的有效性。