We present 4RC, a unified feed-forward framework for 4D reconstruction from monocular videos. Unlike existing approaches that typically decouple motion from geometry or produce limited 4D attributes such as sparse trajectories or two-view scene flow, 4RC learns a holistic 4D representation that jointly captures dense scene geometry and motion dynamics. At its core, 4RC introduces a novel encode-once, query-anywhere and anytime paradigm: a transformer backbone encodes the entire video into a compact spatio-temporal latent space, from which a conditional decoder can efficiently query 3D geometry and motion for any query frame at any target timestamp. To facilitate learning, we represent per-view 4D attributes in a minimally factorized form by decomposing them into base geometry and time-dependent relative motion. Extensive experiments demonstrate that 4RC outperforms prior and concurrent methods across a wide range of 4D reconstruction tasks.
翻译:我们提出4RC,一种用于单目视频四维重建的统一前馈框架。不同于现有方法通常将运动与几何解耦,或仅能生成稀疏轨迹、双视角场景流等有限的四维属性,4RC学习了一种整体性的四维表征,能够联合捕捉密集场景几何与运动动力学。其核心创新在于提出了一种"一次编码、任意位置任意时刻查询"范式:通过Transformer骨干网络将整段视频编码至紧凑的时空潜在空间,条件解码器可从该空间中对任意目标时间戳的查询帧高效查询三维几何与运动信息。为促进学习过程,我们将运动与几何以最小化分解形式表征为每视角四维属性,通过将其分解为基础几何与时间相关相对运动实现。大量实验表明,4RC在广泛的四维重建任务中均优于既往及同期方法。