We present Vid2Avatar, a method to learn human avatars from monocular in-the-wild videos. Reconstructing humans that move naturally from monocular in-the-wild videos is difficult. Solving it requires accurately separating humans from arbitrary backgrounds. Moreover, it requires reconstructing detailed 3D surface from short video sequences, making it even more challenging. Despite these challenges, our method does not require any groundtruth supervision or priors extracted from large datasets of clothed human scans, nor do we rely on any external segmentation modules. Instead, it solves the tasks of scene decomposition and surface reconstruction directly in 3D by modeling both the human and the background in the scene jointly, parameterized via two separate neural fields. Specifically, we define a temporally consistent human representation in canonical space and formulate a global optimization over the background model, the canonical human shape and texture, and per-frame human pose parameters. A coarse-to-fine sampling strategy for volume rendering and novel objectives are introduced for a clean separation of dynamic human and static background, yielding detailed and robust 3D human geometry reconstructions. We evaluate our methods on publicly available datasets and show improvements over prior art.
翻译:我们提出Vid2Avatar方法,用于从单目野生视频中学习人类虚拟化身。从单目野生视频中重建自然运动的人体极具挑战性,其难点在于:需精确分离人体与任意背景,同时从短时视频序列中重建细节丰富的三维表面。尽管面临这些挑战,本方法无需任何真实标注监督、无需从大规模 clothed人体扫描数据集中提取先验知识,也不依赖外部分割模块。取而代之的是,通过联合建模场景中的人体与背景(分别采用两个独立神经场参数化),直接在三维空间中完成场景分解与表面重建任务。具体而言,我们定义了规范空间中时序一致的人体表征,并构建了背景模型、规范人体形状与纹理、以及逐帧人体姿态参数的全局优化框架。通过引入用于体渲染的由粗到精采样策略与新型目标函数,实现了动态人体与静态背景的干净分离,从而获得细节丰富且鲁棒的三维人体几何重建。我们在公开数据集上的实验表明,本方法相较于现有技术取得了显著改进。