GFlow: Recovering 4D World from Monocular Video

Reconstructing 4D scenes from video inputs is a crucial yet challenging task. Conventional methods usually rely on the assumptions of multi-view video inputs, known camera parameters, or static scenes, all of which are typically absent under in-the-wild scenarios. In this paper, we relax all these constraints and tackle a highly ambitious but practical task, which we termed as AnyV4D: we assume only one monocular video is available without any camera parameters as input, and we aim to recover the dynamic 4D world alongside the camera poses. To this end, we introduce GFlow, a new framework that utilizes only 2D priors (depth and optical flow) to lift a video (3D) to a 4D explicit representation, entailing a flow of Gaussian splatting through space and time. GFlow first clusters the scene into still and moving parts, then applies a sequential optimization process that optimizes camera poses and the dynamics of 3D Gaussian points based on 2D priors and scene clustering, ensuring fidelity among neighboring points and smooth movement across frames. Since dynamic scenes always introduce new content, we also propose a new pixel-wise densification strategy for Gaussian points to integrate new visual content. Moreover, GFlow transcends the boundaries of mere 4D reconstruction; it also enables tracking of any points across frames without the need for prior training and segments moving objects from the scene in an unsupervised way. Additionally, the camera poses of each frame can be derived from GFlow, allowing for rendering novel views of a video scene through changing camera pose. By employing the explicit representation, we may readily conduct scene-level or object-level editing as desired, underscoring its versatility and power. Visit our project website at: https://littlepure2333.github.io/GFlow

翻译：从视频输入重建四维场景是一项关键而具有挑战性的任务。传统方法通常依赖于多视角视频输入、已知相机参数或静态场景等假设，而这些条件在真实场景中往往无法满足。本文中，我们放宽所有这些约束，致力于解决一个极具雄心但实际的任务，我们称之为AnyV4D：假设仅有一个单目视频作为输入且没有任何相机参数，我们的目标是同时恢复动态四维世界与相机位姿。为此，我们提出了GFlow——一个仅利用二维先验（深度与光流）将视频（三维）提升为显式四维表示的新框架，实现了高斯泼溅在时空中的流动。GFlow首先将场景聚类为静止与运动部分，随后采用顺序优化流程，基于二维先验与场景聚类结果优化相机位姿与三维高斯点的动态特性，确保相邻点之间的保真度以及跨帧运动的平滑性。由于动态场景总会引入新内容，我们还提出了一种新的像素级高斯点致密化策略以整合新的视觉内容。此外，GFlow超越了单纯的四维重建范畴；它能够在无需预先训练的情况下跨帧追踪任意点，并以无监督方式从场景中分割运动物体。同时，每帧的相机位姿均可从GFlow中导出，支持通过改变相机位姿渲染视频场景的新视角。借助显式表示，我们可以按需轻松进行场景级或物体级编辑，这凸显了其多功能性与强大能力。项目网站请访问：https://littlepure2333.github.io/GFlow