SV-GS：基于骨架驱动高斯泼溅的稀疏视角4D重建 (SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting)

Reconstructing a dynamic target moving over a large area is challenging. Standard approaches for dynamic object reconstruction require dense coverage in both the viewing space and the temporal dimension, typically relying on multi-view videos captured at each time step. However, such setups are only possible in constrained environments. In real-world scenarios, observations are often sparse over time and captured sparsely from diverse viewpoints (e.g., from security cameras), making dynamic reconstruction highly ill-posed. We present SV-GS, a framework that simultaneously estimates a deformation model and the object's motion over time under sparse observations. To initialize SV-GS, we leverage a rough skeleton graph and an initial static reconstruction as inputs to guide motion estimation. (Later, we show that this input requirement can be relaxed.) Our method optimizes a skeleton-driven deformation field composed of a coarse skeleton joint pose estimator and a module for fine-grained deformations. By making only the joint pose estimator time-dependent, our model enables smooth motion interpolation while preserving learned geometric details. Experiments on synthetic datasets show that our method outperforms existing approaches under sparse observations by up to 34% in PSNR, and achieves comparable performance to dense monocular video methods on real-world datasets despite using significantly fewer frames. Moreover, we demonstrate that the input initial static reconstruction can be replaced by a diffusion-based generative prior, making our method more practical for real-world scenarios.

翻译：重建在大范围区域内运动的动态目标具有挑战性。动态物体重建的标准方法需要在视角空间和时间维度上都具有密集的覆盖，通常依赖于在每个时间步捕获的多视角视频。然而，此类设置仅在受限环境中可行。在现实场景中，观测通常在时间上是稀疏的，并且是从不同视角（例如，来自监控摄像头）稀疏捕获的，这使得动态重建成为一个高度不适定问题。我们提出了SV-GS框架，该框架能够在稀疏观测下同时估计变形模型和物体随时间变化的运动。为了初始化SV-GS，我们利用粗略的骨架图和初始静态重建作为输入来指导运动估计。（后文将展示，这一输入要求可以放宽。）我们的方法优化了一个骨架驱动的变形场，该变形场由一个粗略的骨架关节姿态估计器和一个用于细粒度变形的模块组成。通过仅使关节姿态估计器具有时间依赖性，我们的模型能够实现平滑的运动插值，同时保留学习到的几何细节。在合成数据集上的实验表明，在稀疏观测条件下，我们的方法在PSNR指标上优于现有方法高达34%，并且在真实世界数据集上，尽管使用的帧数显著减少，仍能达到与密集单目视频方法相当的性能。此外，我们证明了输入的初始静态重建可以被基于扩散的生成先验所替代，这使得我们的方法在现实场景中更具实用性。