Recently, regression-based methods have dominated the field of 3D human pose and shape estimation. Despite their promising results, a common issue is the misalignment between predictions and image observations, often caused by minor joint rotation errors that accumulate along the kinematic chain. To address this issue, we propose to construct dense correspondences between initial human model estimates and the corresponding images that can be used to refine the initial predictions. To this end, we utilize renderings of the 3D models to predict per-pixel 2D displacements between the synthetic renderings and the RGB images. This allows us to effectively integrate and exploit appearance information of the persons. Our per-pixel displacements can be efficiently transformed to per-visible-vertex displacements and then used for 3D model refinement by minimizing a reprojection loss. To demonstrate the effectiveness of our approach, we refine the initial 3D human mesh predictions of multiple models using different refinement procedures on 3DPW and RICH. We show that our approach not only consistently leads to better image-model alignment, but also to improved 3D accuracy.
翻译:近年来,基于回归的方法主导了三维人体姿态与形状估计领域。尽管取得了显著成果,但预测结果与图像观测之间常存在错位问题,这通常源于沿运动链累积的微小关节旋转误差。为解决此问题,我们提出构建初始人体模型估计与对应图像之间的密集对应关系,用于优化初始预测。为此,我们利用三维模型的渲染结果预测合成渲染图与RGB图像之间的逐像素二维位移,从而有效整合并利用人物外观信息。所获得的逐像素位移可高效转化为逐可见顶点位移,并通过最小化重投影损失实现三维模型的精化。为验证方法有效性,我们在3DPW和RICH数据集上使用不同精化流程对多个模型的初始三维人体网格预测进行优化。实验表明,我们的方法不仅能持续提升图像-模型对齐质量,还能显著改进三维精度。