Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot

We present Multi-HMR, a strong single-shot model for multi-person 3D human mesh recovery from a single RGB image. Predictions encompass the whole body, i.e, including hands and facial expressions, using the SMPL-X parametric model and spatial location in the camera coordinate system. Our model detects people by predicting coarse 2D heatmaps of person centers, using features produced by a standard Vision Transformer (ViT) backbone. It then predicts their whole-body pose, shape and spatial location using a new cross-attention module called the Human Prediction Head (HPH), with one query per detected center token, attending to the entire set of features. As direct prediction of SMPL-X parameters yields suboptimal results, we introduce CUFFS; the Close-Up Frames of Full-Body Subjects dataset, containing humans close to the camera with diverse hand poses. We show that incorporating this dataset into training further enhances predictions, particularly for hands, enabling us to achieve state-of-the-art performance. Multi-HMR also optionally accounts for camera intrinsics, if available, by encoding camera ray directions for each image token. This simple design achieves strong performance on whole-body and body-only benchmarks simultaneously. We train models with various backbone sizes and input resolutions. In particular, using a ViT-S backbone and $448\times448$ input images already yields a fast and competitive model with respect to state-of-the-art methods, while considering larger models and higher resolutions further improve performance.

翻译：我们提出Multi-HMR，一个强大的单次模型，用于从单张RGB图像中恢复多人3D人体网格。预测涵盖全身，即包括手部和面部表情，采用SMPL-X参数化模型及相机坐标系中的空间位置。我们的模型通过预测人体中心的粗粒度二维热图来检测人物，使用标准视觉Transformer（ViT）骨干网络生成的特征。随后，它利用名为人体预测头（HPH）的新交叉注意力模块，为每个检测到的中心令牌分配一个查询，并关注全部特征集，以预测其全身姿态、形状及空间位置。由于直接预测SMPL-X参数效果欠佳，我们引入CUFFS：近距离全身人像数据集，其中包含靠近相机且手部姿态多样的人体。研究表明，将该数据集纳入训练可进一步提升预测效果，尤其对手部而言，从而助力我们实现最先进的性能。若相机内参可用，Multi-HPR还可通过为每个图像令牌编码相机射线方向来可选地考虑这些参数。这一简洁设计同时在全身和仅人体基准测试中取得了强劲性能。我们训练了不同骨干网络大小和输入分辨率的模型。特别地，使用ViT-S骨干网络和$448\times448$输入图像，已能产生与最先进方法相比快速且具有竞争力的模型；而采用更大模型和更高分辨率可进一步提升性能。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日