You Only Train Once: Multi-Identity Free-Viewpoint Neural Human Rendering from Monocular Videos

We introduce You Only Train Once (YOTO), a dynamic human generation framework, which performs free-viewpoint rendering of different human identities with distinct motions, via only one-time training from monocular videos. Most prior works for the task require individualized optimization for each input video that contains a distinct human identity, leading to a significant amount of time and resources for the deployment, thereby impeding the scalability and the overall application potential of the system. In this paper, we tackle this problem by proposing a set of learnable identity codes to expand the capability of the framework for multi-identity free-viewpoint rendering, and an effective pose-conditioned code query mechanism to finely model the pose-dependent non-rigid motions. YOTO optimizes neural radiance fields (NeRF) by utilizing designed identity codes to condition the model for learning various canonical T-pose appearances in a single shared volumetric representation. Besides, our joint learning of multiple identities within a unified model incidentally enables flexible motion transfer in high-quality photo-realistic renderings for all learned appearances. This capability expands its potential use in important applications, including Virtual Reality. We present extensive experimental results on ZJU-MoCap and PeopleSnapshot to clearly demonstrate the effectiveness of our proposed model. YOTO shows state-of-the-art performance on all evaluation metrics while showing significant benefits in training and inference efficiency as well as rendering quality. The code and model will be made publicly available soon.

翻译：我们提出仅需一次训练（YOTO）框架，这是一种动态人体生成框架，通过从单目视频中仅需一次训练即可对不同人体身份执行具有独特动作的自由视角渲染。以往大多数相关任务的工作需要对包含不同人体身份的每个输入视频进行个性化优化，导致部署过程中消耗大量时间和资源，从而阻碍了系统的可扩展性和整体应用潜力。本文通过提出一组可学习的身份编码来扩展框架进行多身份自由视角渲染的能力，并设计了一种有效的姿态条件编码查询机制以精细建模姿态相关的非刚性运动，从而解决这一问题。YOTO通过利用设计的身份编码使模型在有条件的情况下学习共享单一体积表示中的多种规范T-姿态外观，从而优化神经辐射场（NeRF）。此外，在统一模型中联合学习多个身份的能力意外地实现了对所有已学习外观进行高质量逼真渲染的灵活运动迁移。这一能力扩展了其在包括虚拟现实在内的关键应用中的潜在用途。我们在ZJU-MoCap和PeopleSnapshot数据集上进行了大量实验，清晰证明了所提模型的有效性。YOTO在所有评估指标上均表现出最先进性能，同时在训练与推理效率及渲染质量方面展现出显著优势。代码和模型将很快公开提供。