Contrastive Language-Image Pre-training (CLIP) starts to emerge in many computer vision tasks and has achieved promising performance. However, it remains underexplored whether CLIP can be generalized to 3D hand pose estimation, as bridging text prompts with pose-aware features presents significant challenges due to the discrete nature of joint positions in 3D space. In this paper, we make one of the first attempts to propose a novel 3D hand pose estimator from monocular images, dubbed as CLIP-Hand3D, which successfully bridges the gap between text prompts and irregular detailed pose distribution. In particular, the distribution order of hand joints in various 3D space directions is derived from pose labels, forming corresponding text prompts that are subsequently encoded into text representations. Simultaneously, 21 hand joints in the 3D space are retrieved, and their spatial distribution (in x, y, and z axes) is encoded to form pose-aware features. Subsequently, we maximize semantic consistency for a pair of pose-text features following a CLIP-based contrastive learning paradigm. Furthermore, a coarse-to-fine mesh regressor is designed, which is capable of effectively querying joint-aware cues from the feature pyramid. Extensive experiments on several public hand benchmarks show that the proposed model attains a significantly faster inference speed while achieving state-of-the-art performance compared to methods utilizing the similar scale backbone.
翻译:对比语言-图像预训练(CLIP)开始在许多计算机视觉任务中崭露头角并取得了令人瞩目的性能。然而,CLIP能否泛化到三维手势姿态估计仍尚未充分探索,因为将文本提示与姿态感知特征建立联系面临巨大挑战——这是由于三维空间中关节点位置的离散特性所致。本文首次尝试提出一种新颖的基于单目图像的三维手势姿态估计器,命名为CLIP-Hand3D,成功弥合了文本提示与不规则详细姿态分布之间的鸿沟。具体而言,从姿态标签中推导出手部关节在三维空间各方向上的分布顺序,形成对应的文本提示并编码为文本表征;同步检索三维空间中的21个手部关节点,并将其在x、y、z轴上的空间分布编码为姿态感知特征。随后,遵循CLIP对比学习范式最大化姿态-文本特征对的语义一致性。此外,设计了一种从粗到精的网格回归器,能够从特征金字塔中有效查询关节感知线索。在多个公开手部基准数据集上的大量实验表明,与采用相似规模骨干网络的方法相比,所提模型在实现最先进性能的同时,推理速度显著提升。