We present Multi-HMR, a strong single-shot model for multi-person 3D human mesh recovery from a single RGB image. Predictions encompass the whole body, i.e, including hands and facial expressions, using the SMPL-X parametric model and spatial location in the camera coordinate system. Our model detects people by predicting coarse 2D heatmaps of person centers, using features produced by a standard Vision Transformer (ViT) backbone. It then predicts their whole-body pose, shape and spatial location using a new cross-attention module called the Human Prediction Head (HPH), with one query per detected center token, attending to the entire set of features. As direct prediction of SMPL-X parameters yields suboptimal results, we introduce CUFFS; the Close-Up Frames of Full-Body Subjects dataset, containing humans close to the camera with diverse hand poses. We show that incorporating this dataset into training further enhances predictions, particularly for hands, enabling us to achieve state-of-the-art performance. Multi-HMR also optionally accounts for camera intrinsics, if available, by encoding camera ray directions for each image token. This simple design achieves strong performance on whole-body and body-only benchmarks simultaneously. We train models with various backbone sizes and input resolutions. In particular, using a ViT-S backbone and $448\times448$ input images already yields a fast and competitive model with respect to state-of-the-art methods, while considering larger models and higher resolutions further improve performance.
翻译:我们提出Multi-HMR,一个强大的单次模型,用于从单张RGB图像中恢复多人3D人体网格。预测涵盖全身,即包括手部和面部表情,采用SMPL-X参数化模型及相机坐标系中的空间位置。我们的模型通过预测人体中心的粗粒度二维热图来检测人物,使用标准视觉Transformer(ViT)骨干网络生成的特征。随后,它利用名为人体预测头(HPH)的新交叉注意力模块,为每个检测到的中心令牌分配一个查询,并关注全部特征集,以预测其全身姿态、形状及空间位置。由于直接预测SMPL-X参数效果欠佳,我们引入CUFFS:近距离全身人像数据集,其中包含靠近相机且手部姿态多样的人体。研究表明,将该数据集纳入训练可进一步提升预测效果,尤其对手部而言,从而助力我们实现最先进的性能。若相机内参可用,Multi-HPR还可通过为每个图像令牌编码相机射线方向来可选地考虑这些参数。这一简洁设计同时在全身和仅人体基准测试中取得了强劲性能。我们训练了不同骨干网络大小和输入分辨率的模型。特别地,使用ViT-S骨干网络和$448\times448$输入图像,已能产生与最先进方法相比快速且具有竞争力的模型;而采用更大模型和更高分辨率可进一步提升性能。