One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer

Whole-body mesh recovery aims to estimate the 3D human body, face, and hands parameters from a single image. It is challenging to perform this task with a single network due to resolution issues, i.e., the face and hands are usually located in extremely small regions. Existing works usually detect hands and faces, enlarge their resolution to feed in a specific network to predict the parameter, and finally fuse the results. While this copy-paste pipeline can capture the fine-grained details of the face and hands, the connections between different parts cannot be easily recovered in late fusion, leading to implausible 3D rotation and unnatural pose. In this work, we propose a one-stage pipeline for expressive whole-body mesh recovery, named OSX, without separate networks for each part. Specifically, we design a Component Aware Transformer (CAT) composed of a global body encoder and a local face/hand decoder. The encoder predicts the body parameters and provides a high-quality feature map for the decoder, which performs a feature-level upsample-crop scheme to extract high-resolution part-specific features and adopt keypoint-guided deformable attention to estimate hand and face precisely. The whole pipeline is simple yet effective without any manual post-processing and naturally avoids implausible prediction. Comprehensive experiments demonstrate the effectiveness of OSX. Lastly, we build a large-scale Upper-Body dataset (UBody) with high-quality 2D and 3D whole-body annotations. It contains persons with partially visible bodies in diverse real-life scenarios to bridge the gap between the basic task and downstream applications.

翻译：全身网格恢复旨在从单张图像中估计三维人体、人脸和手部参数。由于分辨率问题（即人脸和手部通常位于极小的区域），使用单一网络完成该任务具有挑战性。现有工作通常先检测手部和人脸，放大其分辨率后输入特定网络预测参数，最终融合结果。尽管这种“复制粘贴”流程能够捕捉人脸和手部的精细细节，但不同部位之间的连接难以在后期融合中恢复，导致不合理的三维旋转和姿态异常。本文提出一种名为OSX的单阶段全身网格恢复流程，无需为每个部位设置独立网络。具体而言，我们设计了一个由全局人体编码器和局部人脸/手部解码器组成的部件感知Transformer（CAT）。编码器预测人体参数并为解码器提供高质量特征图，解码器通过特征级上采样-裁剪策略提取高分辨率部位特异性特征，并采用关键点引导的可变形注意力精确估计手部和人脸。整个流程简洁高效，无需人工后处理，且自然避免了不合理预测。大量实验证明了OSX的有效性。最后，我们构建了一个大规模上半身数据集（UBody），包含高质量二维和三维全身标注，涵盖多种真实生活场景中部分可见的人体，以弥合基础任务与下游应用之间的差距。