JOintGS：面向野外单目重建的相机、人体与三维高斯联合优化 (JOintGS: Joint Optimization of Cameras, Bodies and 3D Gaussians for In-the-Wild Monocular Reconstruction)

Reconstructing high-fidelity animatable 3D human avatars from monocular RGB videos remains challenging, particularly in unconstrained in-the-wild scenarios where camera parameters and human poses from off-the-shelf methods (e.g., COLMAP, HMR2.0) are often inaccurate. Splatting (3DGS) advances demonstrate impressive rendering quality and real-time performance, they critically depend on precise camera calibration and pose annotations, limiting their applicability in real-world settings. We present JOintGS, a unified framework that jointly optimizes camera extrinsics, human poses, and 3D Gaussian representations from coarse initialization through a synergistic refinement mechanism. Our key insight is that explicit foreground-background disentanglement enables mutual reinforcement: static background Gaussians anchor camera estimation via multi-view consistency; refined cameras improve human body alignment through accurate temporal correspondence; optimized human poses enhance scene reconstruction by removing dynamic artifacts from static constraints. We further introduce a temporal dynamics module to capture fine-grained pose-dependent deformations and a residual color field to model illumination variations. Extensive experiments on NeuMan and EMDB datasets demonstrate that JOintGS achieves superior reconstruction quality, with 2.1~dB PSNR improvement over state-of-the-art methods on NeuMan dataset, while maintaining real-time rendering. Notably, our method shows significantly enhanced robustness to noisy initialization compared to the baseline.Our source code is available at https://github.com/MiliLab/JOintGS.

翻译：从单目RGB视频中重建高保真可动画三维人体化身仍具挑战性，尤其是在无约束的野外场景中，现有方法（如COLMAP、HMR2.0）提供的相机参数与人体姿态常不准确。尽管溅射（3DGS）相关研究在渲染质量与实时性能上展现出显著进展，但其高度依赖精确的相机标定与姿态标注，限制了在实际场景中的应用。我们提出JOintGS，一个通过协同优化机制从粗初始化联合优化相机外参、人体姿态与三维高斯表示的统一框架。我们的核心见解是，显式的前景-背景解耦能够实现相互增强：静态背景高斯通过多视角一致性锚定相机估计；优化后的相机通过精确的时间对应关系改善人体对齐；优化后的人体姿态通过从静态约束中移除动态伪影来提升场景重建质量。我们进一步引入时序动态模块以捕捉细粒度的姿态相关形变，以及残差颜色场以建模光照变化。在NeuMan和EMDB数据集上的大量实验表明，JOintGS实现了卓越的重建质量，在NeuMan数据集上比现有最优方法PSNR提升2.1~dB，同时保持实时渲染。值得注意的是，与基线方法相比，我们的方法对噪声初始化表现出显著增强的鲁棒性。源代码发布于 https://github.com/MiliLab/JOintGS。