Expressive Human Pose and Shape Estimation (EHPS) plays a crucial role in various AR/VR applications and has witnessed significant progress in recent years. However, current state-of-the-art methods still struggle with accurate parameter estimation for facial and hand regions and exhibit limited generalization to wild images. To address these challenges, we present CoEvoer, a novel one-stage synergistic cross-dependency transformer framework tailored for upper-body EHPS. CoEvoer enables explicit feature-level interaction across different body parts, allowing for mutual enhancement through contextual information exchange. Specifically, larger and more easily estimated regions such as the torso provide global semantics and positional priors to guide the estimation of finer, more complex regions like the face and hands. Conversely, the localized details captured in facial and hand regions help refine and calibrate adjacent body parts. To the best of our knowledge, CoEvoer is the first framework designed specifically for upper-body EHPS, with the goal of capturing the strong coupling and semantic dependencies among the face, hands, and torso through joint parameter regression. Extensive experiments demonstrate that CoEvoer achieves state-of-the-art performance on upper-body benchmarks and exhibits strong generalization capability even on unseen wild images.
翻译:[翻译摘要] 表情丰富的人体姿态与形状估计(EHPS)在各类增强现实/虚拟现实应用中扮演关键角色,近年来取得了显著进展。然而,当前最先进的方法在面部和手部区域的精确参数估计方面仍面临挑战,且对自然场景图像的泛化能力有限。为解决这些问题,我们提出CoEvoer——一种专为上半身EHPS设计的新型单阶段协同交叉依赖Transformer框架。CoEvoer能够实现不同身体部位间的显式特征级交互,通过上下文信息交换实现相互增强。具体而言,躯干等更易估计的大尺度区域为面部、手部等更精细复杂区域的估计提供全局语义与位置先验;反之,面部和手部区域捕获的局部细节则有助于精炼和校准相邻身体部位。据我们所知,CoEvoer是首个专门针对上半身EHPS设计的框架,其核心目标是通过联合参数回归捕获面部、手部与躯干间的强耦合性与语义依赖关系。大量实验表明,CoEvoer在多个上半身基准数据集上达到最优性能,即便面对未见过的自然场景图像也展现出强大泛化能力。