In this paper, we present a method to reconstruct the world and multiple dynamic humans in 3D from a monocular video input. As a key idea, we represent both the world and multiple humans via the recently emerging 3D Gaussian Splatting (3D-GS) representation, enabling to conveniently and efficiently compose and render them together. In particular, we address the scenarios with severely limited and sparse observations in 3D human reconstruction, a common challenge encountered in the real world. To tackle this challenge, we introduce a novel approach to optimize the 3D-GS representation in a canonical space by fusing the sparse cues in the common space, where we leverage a pre-trained 2D diffusion model to synthesize unseen views while keeping the consistency with the observed 2D appearances. We demonstrate our method can reconstruct high-quality animatable 3D humans in various challenging examples, in the presence of occlusion, image crops, few-shot, and extremely sparse observations. After reconstruction, our method is capable of not only rendering the scene in any novel views at arbitrary time instances, but also editing the 3D scene by removing individual humans or applying different motions for each human. Through various experiments, we demonstrate the quality and efficiency of our methods over alternative existing approaches.
翻译:本文提出一种从单目视频输入中重建三维世界及多个动态人体的方法。核心思想在于,我们采用近期涌现的三维高斯泼溅(3D-GS)表示法,对三维世界及多个动态人体进行统一表征,从而便捷高效地实现场景组合与渲染。本研究特别聚焦于三维人体重建中普遍存在的观测数据极度稀疏与受限这一现实难题。为攻克该挑战,我们创新性地提出一种在规范空间中优化3D-GS表示的方法:通过融合公共空间中的稀疏线索,并借助预训练的二维扩散模型合成未见视角,同时确保合成结果与观测到的二维外观保持一致性。实验证明,在存在遮挡、图像裁剪、少样本及极端稀疏观测等多种挑战性场景下,本方法能重建出高质量且可驱动的高保真三维人体。重建完成后,本方法不仅能从任意新颖视角与任意时间点渲染场景,还可通过移除特定人体或为每个人体施加不同动作来编辑三维场景。通过多项实验,我们验证了本方法相较现有替代方案在质量与效率方面的优越性。