Dense 4D reconstruction from unposed images remains a critical challenge, with current methods relying on slow test-time optimization or fragmented, task-specific feedforward models. We introduce UFO-4D, a unified feedforward framework to reconstruct a dense, explicit 4D representation from just a pair of unposed images. UFO-4D directly estimates dynamic 3D Gaussian Splats, enabling the joint and consistent estimation of 3D geometry, 3D motion, and camera pose in a feedforward manner. Our core insight is that differentiably rendering multiple signals from a single Dynamic 3D Gaussian representation offers major training advantages. This approach enables a self-supervised image synthesis loss while tightly coupling appearance, depth, and motion. Since all modalities share the same geometric primitives, supervising one inherently regularizes and improves the others. This synergy overcomes data scarcity, allowing UFO-4D to outperform prior work by up to 3 times in joint geometry, motion, and camera pose estimation. Our representation also enables high-fidelity 4D interpolation across novel views and time. Please visit our project page for visual results: https://ufo-4d.github.io/
翻译:从非固定视角图像进行密集四维重建仍是一个关键挑战,现有方法依赖于耗时的测试时优化或零散、任务特定的前馈模型。我们提出了UFO-4D,一个统一的前馈框架,仅需一对非固定视角图像即可重建出密集、显式的四维表示。UFO-4D直接估计动态3D高斯泼溅,以前馈方式联合且一致地估计三维几何、三维运动和相机位姿。我们的核心洞见是,从单一动态3D高斯表示中对多个信号进行可微分渲染,能带来显著的训练优势。该方法实现了自监督的图像合成损失,同时将外观、深度和运动紧密耦合。由于所有模态共享相同的几何基元,监督其中一种模态本质上会正则化并改进其他模态。这种协同作用克服了数据稀缺问题,使UFO-4D在联合几何、运动和相机位姿估计任务上,性能超越先前工作达3倍。我们的表示还能实现跨新视角和时间的高保真四维插值。请访问我们的项目页面查看可视化结果:https://ufo-4d.github.io/